Re: Question about Google Books Ngrams with pyspark (1.4.1)
Looking at another forum, I tried : files = sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram","com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text") Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/context.py", line 574, in newAPIHadoopFile jconf, batchSize) File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : java.lang.ClassNotFoundException: com.hadoop.mapreduce.LzoTextInputFormat v Thanks for your help, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24556.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question about Google Books Ngrams with pyspark (1.4.1)
Hello everybody, I followed the steps from https://issues.apache.org/jira/browse/SPARK-2394 to read LZO-compressed files, but now I cannot even open a file with : lines = sc.textFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram") >>> lines.first() Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/rdd.py", line 1295, in first rs = self.take(1) File "/root/spark/python/pyspark/rdd.py", line 1247, in take totalParts = self.getNumPartitions() File "/root/spark/python/pyspark/rdd.py", line 355, in getNumPartitions return self._jrdd.partitions().size() File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions. : java.lang.RuntimeException: Error in configuring object lines = sc.sequenceFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram") Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/context.py", line 544, in sequenceFile keyConverter, valueConverter, minSplits, batchSize) File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.sequenceFile. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.31.12.23): java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec Thanks for your help, Cheers, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24546.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question about Google Books Ngrams with pyspark (1.4.1)
Thanks for your prompt reply. I will follow https://issues.apache.org/jira/browse/SPARK-2394 and will let you know if everything works. Cheers, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24545.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question about Google Books Ngrams with pyspark (1.4.1)
Do you have LZO configured? see http://stackoverflow.com/questions/14808041/how-to-have-lzo-compression-in-hadoop-mapreduce --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/malak/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24544.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question about Google Books Ngrams with pyspark (1.4.1)
Hello everybody, I am trying to read the Google Books Ngrams with pyspark on Amazon EC2. I followed the steps from : http://spark.apache.org/docs/latest/ec2-scripts.html and everything is working fine. I am able to read the file : lines = sc.textFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram") lines.first() u'SEQ\x06!org.apache.hadoop.io.LongWritable\x19org.apache.hadoop.io.Text\x01\x01#com.hadoop.compression.lzo.LzoCodec \x00\x00\x00\x00\ufffd If I now want to read the file using : lines = sc.sequenceFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram") I have the following error message : 15/09/01 15:28:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 172.31.61.41): java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/context.py", line 544, in sequenceFile keyConverter, valueConverter, minSplits, batchSize) File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.sequenceFile. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 172.31.61.41): java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec Could you please help me reading the file with pyspark ? Thank you for your help, Cheers, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org