Hello,

I have been trying to play with the Google ngram dataset provided by Amazon in form of LZO compressed files.

I am having trouble understanding what is going on ;). I have added the compression jar and native library to the underlying Hadoop/HDFS installation, restarted the name node and the datanodes, Spark can obviously see the file but I get gibberish on a read. Any ideas?

See output below:

14/07/13 14:39:19 INFO SparkContext: Added JAR file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with timestamp 1405262359777
14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with curMem=0, maxMem=311387750 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 160.0 KB, free 296.8 MB) f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15, took 0.419708348 s res0: Array[String] = Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�?????? ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�? �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �? �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...

Thanks!
Ognen

Reply via email to