[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375766#comment-14375766 ]
Theodore Vasiloudis commented on SPARK-2394: -------------------------------------------- Just adding some more info here for people who end up here through searches: Steps 1-3 can be completed by running this script on each machine on you cluster: https://gist.github.com/thvasilo/7696d21cb3205f5cb11d There should be an easy way to execute this script when the cluster is being launched, I tried using the --user-data flag but that doesn't seem to do that. Otherwise you'd have to rsync this script into each machine (easy, use ~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh into each machine and run it (not so easy) For Step 4, make sure that the core-site.xml in changed in both the hadoop config, as well as the spark-conf/ directory. Also as suggested in the hadoop-lzo docs {quote} Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line: {code} JAVA_LIBRARY_PATH='' {code} {quote} Here's how I set the vars in spark-env.sh: {code} export SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/" export SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar" {code} And what I added to both core-site.xml {code:xml} <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> {code} As for the code (Step 5) itself, I've tried the different variations suggested in the mailing list and other places and ended up using the following: https://gist.github.com/thvasilo/cd99709eacb44c8a8cff Note that this uses the sequenceFile reader, specifically for the Google Ngrams. Setting the minPartitions is important in order to get good parallelization with what you with the data later on. (3*cores in your cluster seems like a good value) You can run the above job using: {code} ./bin/spark-submit --jars local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg {code} you should of course set the env variables for you spark master and the location of your fat jar. Note that I'm passing the hadoop-lzo jar as local, that assumes that every node has built the jar, which is done by the script given above. Do the above and you should get the count and the first line of the data when running the job. > Make it easier to read LZO-compressed files from EC2 clusters > ------------------------------------------------------------- > > Key: SPARK-2394 > URL: https://issues.apache.org/jira/browse/SPARK-2394 > Project: Spark > Issue Type: Improvement > Components: EC2, Input/Output > Affects Versions: 1.0.0 > Reporter: Nicholas Chammas > Priority: Minor > Labels: compression > > Amazon hosts [a large Google n-grams data set on > S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is > perfect, among other things, for putting together interesting and easily > reproducible public demos of Spark's capabilities. > The problem is that the data set is compressed using LZO, and it is currently > more painful than it should be to get your average {{spark-ec2}} cluster to > read input compressed in this way. > This is what one has to go through to get a Spark cluster created with > {{spark-ec2}} to read LZO-compressed files: > # Install the latest LZO release, perhaps via {{yum}}. > # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build > it. To build {{hadoop-lzo}} you need Maven. > # Install Maven. For some reason, [you cannot install Maven with > {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], > so install it manually. > # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate > configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. > # Make [the appropriate > calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] > to {{sc.newAPIHadoopFile}}. > This seems like a bit too much work for what we're trying to accomplish. > If we expect this to be a common pattern -- reading LZO-compressed files from > a {{spark-ec2}} cluster -- it would be great if we could somehow make this > less painful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org