Nicholas Chammas created SPARK-2394: ---------------------------------------
Summary: Make it easier to read LZO-compressed files from EC2 clusters Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could somehow make this less painful. -- This message was sent by Atlassian JIRA (v6.2#6252)