IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
It looks like Spark 1.5.1 does not work with IPv6. When adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the driver fails with: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at

Re: IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
ng(hostPort).hasPort, message) } On Wed, Oct 14, 2015 at 2:40 PM, Thomas Dudziak <tom...@gmail.com> wrote: > It looks like Spark 1.5.1 does not work with IPv6. When > adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the > driver fails with: > > 15

Yahoo's Caffe-on-Spark project

2015-09-29 Thread Thomas Dudziak
http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop I would be curious to learn what the Spark developer's plans are in this area (NNs, GPUs) and what they think of integration with existing NN frameworks like Caffe or Torch. cheers, Tom

Accumulator with non-java-serializable value ?

2015-09-09 Thread Thomas Dudziak
I want to use t-digest with foreachPartition and accumulators (essentially, create a t-digest per partition and add that to the accumulator leveraging the fact that t-digests can be added to each other). I can make t-digests kryo-serializable easily but java-serializable is not very easy. Now,

Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
a lot of > garbage, making it slower. SMJ performance is probably 5x - 1000x better in > 1.5 for your case. > > > On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak <tom...@gmail.com> wrote: > >> I'm getting errors like "Removing executor with no recent heartbeats&

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug 27, 2015 at 6:04 PM Thomas Dudziak tom...@gmail.com wrote: I'm getting errors like Removing executor with no recent heartbeats Missing

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
the answer was to further up the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug 27, 2015 at 6:04 PM Thomas Dudziak tom...@gmail.com wrote: I'm getting errors like Removing

How to avoid shuffle errors for a large join ?

2015-08-27 Thread Thomas Dudziak
I'm getting errors like Removing executor with no recent heartbeats Missing an output location for shuffle errors for a large SparkSql join (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job to avoid them. The initial stage completes fine with some 30k tasks on

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
...@gmail.com wrote: Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak tom...@gmail.com a écrit : Sorry, I meant without reading from all splits. This is a single partition in the table

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak tom...@gmail.com wrote: I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei On May 19, 2015, at 12:39 PM, Thomas Dudziak tom...@gmail.com wrote: I read the other day that there will be a fair number of improvements

Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I read the other day that there will be a fair number of improvements in 1.4 for Mesos. Could I ask for one more (if it isn't already in there): a configurable limit for the number of tasks for jobs run on Mesos ? This would be a very simple yet effective way to prevent a job dominating the

Exception when using CLUSTER BY or ORDER BY

2015-05-19 Thread Thomas Dudziak
Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext (Spark 1.3.1 on Mesos, fine-grained mode). Is this a known problem or should I file a JIRA for it ? org.apache.spark.SparkException: Can only zip RDDs with same

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
This is still a problem in 1.3. Optional is both used in several shaded classes within Guava (e.g. the Immutable* classes) and itself uses shaded classes (e.g. AbstractIterator). This causes problems in application code. The only reliable way we've found around this is to shade Guava ourselves for

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
Actually the extraClassPath settings put the extra jars at the end of the classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them at the front. cheers, Tom On Fri, May 15, 2015 at 11:54 AM, Marcelo Vanzin van...@cloudera.com wrote: Ah, I see. yeah, it sucks that Spark has

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
I've just been through this exact case with shaded guava in our Mesos setup and that is how it behaves there (with Spark 1.3.1). cheers, Tom On Fri, May 15, 2015 at 12:04 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak tom...@gmail.com wrote