Wrong temp directory when compressing before sending text file to S3

2014-11-06 Thread Gary Malouf
We have some data that we are exporting from our HDFS cluster to S3 with some help from Spark. The final RDD command we run is: csvData.saveAsTextFile(s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip, classOf[GzipCodec]) We have our 'spark.local.dir' set to our large ephemeral partition on

Parquet Migrations

2014-10-31 Thread Gary Malouf
Outside of what is discussed here https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to

Re: Parquet schema migrations

2014-10-24 Thread Gary Malouf
Hi Michael, Does this affect people who use Hive for their metadata store as well? I'm wondering if the issue is as bad as I think it is - namely that if you build up a year's worth of data, adding a field forces you to have to migrate that entire year's data. Gary On Wed, Oct 8, 2014 at 5:08

Re: guava version conflicts

2014-09-22 Thread Gary Malouf
Hi Marcelo, Interested to hear the approach to be taken. Shading guava itself seems extreme, but that might make sense. Gary On Sat, Sep 20, 2014 at 9:38 PM, Marcelo Vanzin van...@cloudera.com wrote: Hmm, looks like the hack to maintain backwards compatibility in the Java API didn't work

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Gary Malouf
I'm kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org

CoHadoop Papers

2014-08-26 Thread Gary Malouf
One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
partitioning and the need to shuffle in the Spark SQL planner. On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote: Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have

Re: Mesos/Spark Deadlock

2014-08-25 Thread Gary Malouf
/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual

Mesos/Spark Deadlock

2014-08-23 Thread Gary Malouf
I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior

Re: Mesos/Spark Deadlock

2014-08-23 Thread Gary Malouf
can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes

Re: [SPARK-3050] Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster

2014-08-14 Thread Gary Malouf
To be clear, is it 'compiled' against 1.0.2 or it packaged with it? On Thu, Aug 14, 2014 at 6:39 PM, Mingyu Kim m...@palantir.com wrote: I ran a really simple code that runs with Spark 1.0.2 jar and connects to a Spark 1.0.1 cluster, but it fails with java.io.InvalidClassException. I filed

Re: replacement for SPARK_JAVA_OPTS

2014-08-07 Thread Gary Malouf
Can this be cherry-picked for 1.1 if everything works out? In my opinion, it could be qualified as a bug fix. On Thu, Aug 7, 2014 at 5:47 PM, Marcelo Vanzin van...@cloudera.com wrote: Andrew has been working on a fix: https://github.com/apache/spark/pull/1770 On Thu, Aug 7, 2014 at 2:35

Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going well. Looking at the Mesos slave logs, I noticed: ERROR KryoSerializer: Failed to run spark.kryo.registrator java.lang.ClassNotFoundException: com/mediacrossing/verrazano/kryo/MxDataRegistrator My spark-env.sh has the

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We use the Hadoop configuration inside of our code executing on Spark as we need to list out files in the path. Maybe that is why it is exposed for us. On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nishkam, Aaron's fix should prevent two concurrent accesses

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We'll try to run a build tomorrow AM. On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote: Andrew and Gary, Would you guys be able to test https://github.com/apache/spark/pull/1409/files and see if it solves your problem? - Patrick On Mon, Jul 14, 2014 at 4:18 PM,

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Gary Malouf
information about the scope and severity. Could you fork another thread for this? - Patrick On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
-1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
Just realized the deadline was Monday, my apologies. The issue nevertheless stands. On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility

Re: Spark on Scala 2.11

2014-05-10 Thread Gary Malouf
Considering the team just bumped to 2.10 in 0.9, I would be surprised if this is a near term priority. On Thu, May 8, 2014 at 9:33 PM, Anand Avati av...@gluster.org wrote: Is there an ongoing effort (or intent) to support Spark on Scala 2.11? Approximate timeline? Thanks