Nicholas Chammas created SPARK-2394:
---------------------------------------

             Summary: Make it easier to read LZO-compressed files from EC2 
clusters
                 Key: SPARK-2394
                 URL: https://issues.apache.org/jira/browse/SPARK-2394
             Project: Spark
          Issue Type: Improvement
          Components: EC2, Input/Output
    Affects Versions: 1.0.0
            Reporter: Nicholas Chammas
            Priority: Minor


Amazon hosts [a large Google n-grams data set on 
S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, 
among other things, for putting together interesting and easily reproducible 
public demos of Spark's capabilities.

The problem is that the data set is compressed using LZO, and it is currently 
more painful than it should be to get your average {{spark-ec2}} cluster to 
read input compressed in this way.

This is what one has to go through to get a Spark cluster created with 
{{spark-ec2}} to read LZO-compressed files:
# Install the latest LZO release, perhaps via {{yum}}.
# Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. 
To build {{hadoop-lzo}} you need Maven. 
# Install Maven. For some reason, [you cannot install Maven with 
{{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
 so install it manually.
# Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
# Make [the appropriate 
calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
 to {{sc.newAPIHadoopFile}}.

This seems like a bit too much work for what we're trying to accomplish.

If we expect this to be a common pattern -- reading LZO-compressed files from a 
{{spark-ec2}} cluster -- it would be great if we could somehow make this less 
painful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to