Hi,
>From what I can tell, that's an error in Ranger, not in Spark, as you can
see by the package where the exception is thrown.
Spark Thrift server in this instance is merely trying to call a Hadoop API,
which then gets hijacked by Ranger.
Your best bet is to look at the case in question, try
Hi,
We solved this the ugly way, when parsing external column definitions:
private def columnTypeToFieldType(columnType: String): DataType = {
columnType match {
case "IntegerType" => IntegerType
case "StringType" => StringType
case "DateType" => DateType
case "FloatType" =>
Hi Gerard, hi List,
I think what this would entail is for Source.commit to change its
funcationality. You would need to track all streams' offsets there.
Especially in the socket source, you already have a cache (haven't looked
at Kafka's implementation to closely yet), so that shouldn't be the
Put your jobs into a parallel collection using .par -- then you can submit
them very easily to Spark, using .foreach. The jobs will then run using the
FIFO scheduler in Spark.
The advantage over the prior approaches are, that you won't have to deal
with Threads, and that you can leave parallelism
Keeping it inside the same program/SparkContext is the most performant
solution, since you can avoid serialization and deserialization.
In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
and invokes serialization and deserialization. Technologies that can help
you do that
I would try to track down the "no space left on device" - find out where
that originates from, since you should be able to allocate 10 executors
with 4 cores and 15GB RAM each quite easily. In that case,you may want to
increase overhead, so yarn doesn't kill your executors.
Check that no local
In Scala you can first define your columns, and then use the
list-to-vararg-expander :_* in a select call, something like this:
val cols = colnames.map(col).map(column => {
*lit(0)*
})
dF.select(cols: _*)
I assume something similar should be possible in Java as well, from
your snippet it's
Hi List,
I'm wondering if the following behaviour should be considered a bug, or
whether it "works as designed":
I'm starting multiple concurrent (FIFO-scheduled) jobs in a single
SparkContext, some of which write into the same tables.
When these tables already exist, it appears as though both
Potentially, with joins, you run out of memory on a single executor,
because a small skew in your data is being amplified. You could try to
increase the default number of partitions, reduce the number of
simultaneous tasks in execution (executor.num.cores), or add a
repartitioning operation
Hi List,
I'm currently trying to naively implement a Data-Vault-type Data-Warehouse
using SparkSQL, and was wondering whether there's an inherent practical
limit to query complexity, beyond which SparkSQL will stop functioning,
even for relatively small amounts of data.
I'm currently looking at
If you have enough RAM/SSDs available, maybe tiered HDFS storage and
Parquet might also be an option. Of course, management-wise it has much
more overhead than using ES, since you need to manually define partitions
and buckets, which is suboptimal. On the other hand, for querying, you can
probably
ars or so".
So now is to finding out why that's the case, and how to actually get to
the point, where these features could work in 2 years, and whether they
should work at all
On Tue, Jan 17, 2017 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote:
> On Tue, Jan 17, 2017 at 4:49 PM
Hi List,
I've been following several projects with quite some interest over the past
few years, and I've continued to wonder, why they're not moving towards a
degree of being supported by mainstream Spark-distributions, and more
frequently mentioned when it comes to enterprise adoption of Spark.
Hi Divya,
I haven't actually used the package yet, but maybe you should check out the
gitter-room where the creator is quite active. You can find it on
https://gitter.im/FRosner/drunken-data-quality .
There you should be able to get the information you need.
Best,
Rick
On 6 May 2016 12:34,
Something to check (just in case):
Are you getting identical results each time?
On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote:
> Hi sparkers,
>
> I am using dataframe to do some large ETL jobs.
> More precisely, I create dataframe from HIVE table and do some operations.
>
lizers are used and may be then do
> an analysis.
>
> Best,
> Kartik
>
> On Mon, Sep 28, 2015 at 11:38 AM, Rick Moritz <rah...@gmail.com> wrote:
>
>> Hi Kartik,
>>
>> Thanks for the input!
>>
>> Sadly, that's not it - I'm using YARN - the c
more shuffled data for the same number of shuffled
tuples?
An analysis would be much appreciated.
Best,
Rick
On Wed, Aug 19, 2015 at 2:47 PM, Rick Moritz <rah...@gmail.com> wrote:
> oops, forgot to reply-all on this thread.
>
> -- Forwarded message --
> From
A quick question regarding this: how come the artifacts (spark-core in
particular) on Maven Central are built with JDK 1.6 (according to the
manifest), if Java 7 is required?
On Aug 21, 2015 5:32 PM, Sean Owen so...@cloudera.com wrote:
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM
7. Or some later repackaging process ran on the
artifacts and used Java 6. I do see Build-Jdk: 1.6.0_45 in the
manifest, but I don't think 1.4.x can compile with Java 6.
On Tue, Aug 25, 2015 at 9:59 PM, Rick Moritz rah...@gmail.com wrote:
A quick question regarding this: how come the artifacts
oops, forgot to reply-all on this thread.
-- Forwarded message --
From: Rick Moritz rah...@gmail.com
Date: Wed, Aug 19, 2015 at 2:46 PM
Subject: Re: Strange shuffle behaviour difference between Zeppelin and
Spark-shell
To: Igor Berman igor.ber...@gmail.com
Those values
?
On 19 August 2015 at 09:49, Rick Moritz rah...@gmail.com wrote:
Dear list,
I am observing a very strange difference in behaviour between a Spark
1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin
interpreter (compiled with Java 6 and sourced from maven central
-submit it
using different spark-binaries to further explore the issue.
Best Regards,
Rick Moritz
PS: I already tried to send this mail yesterday, but it never made it onto
the list, as far as I can tell -- I apologize should anyone receive this as
a second copy.
Consider the spark.max.cores configuration option -- it should do what you
require.
On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula
aharipriy...@gmail.com wrote:
Hello all,
As a quick follow up for this, I have been using Spark on Yarn till now
and am currently exploring Mesos
Dear List,
I'm trying to reference a lonely message to this list from March 25th,(
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html
), but I'm unsure this will thread properly. Sorry, if didn't work out.
Anyway, using Spark 1.4.0-RC4 I run into the same
24 matches
Mail list logo