We have some data that we are exporting from our HDFS cluster to S3 with
some help from Spark. The final RDD command we run is:
csvData.saveAsTextFile(s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip,
classOf[GzipCodec])
We have our 'spark.local.dir' set to our large ephemeral partition on
Outside of what is discussed here
https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is
there any path for being able to modify a Parquet schema once some data has
been written? This seems like the kind of thing that should make people
pause when considering whether or not to
Hi Michael,
Does this affect people who use Hive for their metadata store as well? I'm
wondering if the issue is as bad as I think it is - namely that if you
build up a year's worth of data, adding a field forces you to have to
migrate that entire year's data.
Gary
On Wed, Oct 8, 2014 at 5:08
Hi Marcelo,
Interested to hear the approach to be taken. Shading guava itself seems
extreme, but that might make sense.
Gary
On Sat, Sep 20, 2014 at 9:38 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hmm, looks like the hack to maintain backwards compatibility in the
Java API didn't work
I'm kind of surprised this was not run into before. Do people not
segregate their data by day/week in the HDFS directory structure?
On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com
wrote:
Thanks!
On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org
One of my colleagues has been questioning me as to why Spark/HDFS makes no
attempts to try to co-locate related data blocks. He pointed to this
paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
CoHadoop research and the performance improvements it yielded for
Map/Reduce
It appears support for this type of control over block placement is going
out in the next version of HDFS:
https://issues.apache.org/jira/browse/HDFS-2576
On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote:
One of my colleagues has been questioning me as to why Spark/HDFS
. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful contributions for its set of use cases.
Sent while mobile. Pls excuse typos etc.
On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
It appears support for this type
partitioning and the need to
shuffle in the Spark SQL planner.
On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com
wrote:
Christopher, can you expand on the co-partitioning support?
We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have
/pull/1860 into Spark 1.1. Incidentally
have you tried that?
Matei
On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com)
wrote:
Hi Matei,
We have an analytics team that uses the cluster on a daily basis. They
use two types of 'run modes':
1) For running actual
I just wanted to bring up a significant Mesos/Spark issue that makes the
combo difficult to use for teams larger than 4-5 people. It's covered in
https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that
Spark's use of executors in fine-grained mode is a very different behavior
can use Mesos in
coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold
onto CPUs for the duration of the job.
Matei
On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com)
wrote:
I just wanted to bring up a significant Mesos/Spark issue that makes
To be clear, is it 'compiled' against 1.0.2 or it packaged with it?
On Thu, Aug 14, 2014 at 6:39 PM, Mingyu Kim m...@palantir.com wrote:
I ran a really simple code that runs with Spark 1.0.2 jar and connects to
a Spark 1.0.1 cluster, but it fails with java.io.InvalidClassException. I
filed
Can this be cherry-picked for 1.1 if everything works out? In my opinion,
it could be qualified as a bug fix.
On Thu, Aug 7, 2014 at 5:47 PM, Marcelo Vanzin van...@cloudera.com wrote:
Andrew has been working on a fix:
https://github.com/apache/spark/pull/1770
On Thu, Aug 7, 2014 at 2:35
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well. Looking at the Mesos slave logs, I noticed:
ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo/MxDataRegistrator
My spark-env.sh has the
We use the Hadoop configuration inside of our code executing on Spark as we
need to list out files in the path. Maybe that is why it is exposed for us.
On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Nishkam,
Aaron's fix should prevent two concurrent accesses
We'll try to run a build tomorrow AM.
On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote:
Andrew and Gary,
Would you guys be able to test
https://github.com/apache/spark/pull/1409/files and see if it solves
your problem?
- Patrick
On Mon, Jul 14, 2014 at 4:18 PM,
information about the scope and severity. Could you fork another
thread for this?
- Patrick
On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
wrote:
-1 I honestly do not know the voting rules for the Spark community, so
please excuse me if I am out of line or if Mesos
-1 I honestly do not know the voting rules for the Spark community, so
please excuse me if I am out of line or if Mesos compatibility is not a
concern at this point.
We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
0.18.2. All of our jobs with data above a few gigabytes
Just realized the deadline was Monday, my apologies. The issue
nevertheless stands.
On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf malouf.g...@gmail.com wrote:
-1 I honestly do not know the voting rules for the Spark community, so
please excuse me if I am out of line or if Mesos compatibility
Considering the team just bumped to 2.10 in 0.9, I would be surprised if
this is a near term priority.
On Thu, May 8, 2014 at 9:33 PM, Anand Avati av...@gluster.org wrote:
Is there an ongoing effort (or intent) to support Spark on Scala 2.11?
Approximate timeline?
Thanks
21 matches
Mail list logo