We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo
enabled & sort based shuffle.
Did you find a resolution?
On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu wrote:
> Not much.
>
> So no chance of different snappy version ?
>
> On Fri, Apr 8, 2016 at 1:26 PM,
Hi all,
Doing some simple column transformations (e.g. trimming strings) on a
DataFrame using UDFs. This DataFrame is in Avro format and being loaded off
HDFS. The job has about 16,000 parts/tasks.
About half way through the job, then fails with a message;
org.apache.spark.SparkException: Job
a array double with 3 fields, the prediction,
the class A probability and the class B probability. How could I make those
like 3 columns from my expression? Clearly .withColumn only expects 1
column back.
On Tue, Sep 8, 2015 at 6:21 PM, Night Wolf <nightwolf...@gmail.com> wrote:
> Sorry for
ely a string where parameters are Comma-separated...
>
> Le lun. 7 sept. 2015 à 8:35, Night Wolf <nightwolf...@gmail.com> a écrit :
>
>> Is it possible to have a UDF which takes a variable number of arguments?
>>
>> e.g. df.select(myUdf($"*")) fails with
&
t 5:47 PM, Night Wolf <nightwolf...@gmail.com> wrote:
> So basically I need something like
>
> df.withColumn("score", new Column(new Expression {
> ...
>
> def eval(input: Row = null): EvaluatedType = myModel.score(input)
> ...
>
> }))
>
> But
ue or some struct...
On Tue, Sep 8, 2015 at 5:33 PM, Night Wolf <nightwolf...@gmail.com> wrote:
> Not sure how that would work. Really I want to tack on an extra column
> onto the DF with a UDF that can take a Row object.
>
> On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke <jornfr
Is it possible to have a UDF which takes a variable number of arguments?
e.g. df.select(myUdf($"*")) fails with
org.apache.spark.sql.AnalysisException: unresolved operator 'Project
[scalaUDF(*) AS scalaUDF(*)#26];
What I would like to do is pass in a generic data frame which can be then
passed
Hey all,
I'm trying to do some stuff with a YAML file in the Spark driver using
SnakeYAML library in scala.
When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to
de-serialize some objects from YAML into classes in my app JAR on the
driver (only the driver). I get the
Hi guys,
I'm trying to do a cross join (cartesian product) with 3 tables stored as
parquet. Each table has 1 column, a long key.
Table A has 60,000 keys with 1000 partitions
Table B has 1000 keys with 1 partition
Table C has 4 keys with 1 partition
The output should be 240million row
How far did you get?
On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
We use Scoobi + MR to perform joins and we particularly use blockJoin()
API of scoobi
/** Perform an equijoin with another distributed list where this list is
considerably smaller
* than the
: Running task 11093.0 in stage 0.0
(TID 9552)
15/06/16 13:43:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 9553
15/06/16 13:43:22 INFO executor.Executor: Running task 10323.1 in stage 0.0
(TID 9553)
On Tue, Jun 16, 2015 at 1:47 PM, Night Wolf nightwolf...@gmail.com wrote:
Hi guys
Hi guys,
Using Spark 1.4, trying to save a dataframe as a table, a really simple
test, but I'm getting a bunch of NPEs;
The code Im running is very simple;
qc.read.parquet(/user/sparkuser/data/staged/item_sales_basket_id.parquet).write.format(parquet).saveAsTable(is_20150617_test2)
Logs of
tasks or regular tasks (the first attempt of the task)? Is this error
deterministic (can you reproduce every time you run this command)?
Thanks,
Yin
On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf nightwolf...@gmail.com
wrote:
Looking at the logs of the executor, looks like it fails to find
?
spark.sql.hive.metastore.sharedPrefixes
com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
https://issues.apache.org/jira/browse/SPARK-7819 has more context about
it.
On Wed, Jun 3, 2015 at 9:38 PM, Night Wolf nightwolf
Hi all,
Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with
Hive support.
*Build command;*
./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4
-Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided
-Dhadoop.version=2.5.1-mapr-1501
: Opening proxy
: qtausc-pphd0177.hadoop.local:40237
15/06/03 10:34:31 INFO impl.AMRMClientImpl: Received new token for :
qtausc-pphd0132.hadoop.local:44108
15/06/03 10:34:31 INFO yarn.YarnAllocator: Received 1 containers from YARN,
launching executors on 0 of them.
On Wed, Jun 3, 2015 at 10:29 AM, Night
1.3 and 1.4; it also has been
working fine for me.
Are you sure you're using exactly the same Hadoop libraries (since you're
building with -Phadoop-provided) and Hadoop configuration in both cases?
On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf nightwolf...@gmail.com wrote:
Hi all,
Trying out
Hi all,
I have a job that, for every row, creates about 20 new objects (i.e. RDD of
100 rows in = RDD 2000 rows out). The reason for this is each row is tagged
with a list of the 'buckets' or 'windows' it belongs to.
The actual data is about 10 billion rows. Each executor has 60GB of memory.
Hi guys,
If I load a dataframe via a sql context that has a SORT BY in the query and
I want to repartition the data frame will it keep the sort order in each
partition?
I want to repartition because I'm going to run a Map that generates lots of
data internally so to avoid Out Of Memory errors I
I'm seeing a similar thing with a slightly different stack trace. Ideas?
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
Seeing similar issues, did you find a solution? One would be to increase
the number of partitions if you're doing lots of object creation.
On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com fightf...@163.com
wrote:
Hi, patrick
Really glad to get your reply.
Yes, we are doing group by
? I was experimenting with Row class in
python and apparently partitionby automatically takes first column as key.
However, I am not sure how you can access a part of an object without
deserializing it (either explicitly or Spark doing it for you)
On Wed, May 6, 2015 at 7:14 PM, Night Wolf
Hi,
If I have an RDD[MyClass] and I want to partition it by the hash code of
MyClass for performance reasons, is there any way to do this without
converting it into a PairRDD RDD[(K,V)] and calling partitionBy???
Mapping it to a tuple2 seems like a waste of space/computation.
It looks like the
Thanks Andrew. What version of HS2 is the SparkSQL thrift server using?
What would be involved in updating? Is it a simple case of increasing the
deep version in one of the project POMs?
Cheers,
~N
On Sat, May 2, 2015 at 11:38 AM, Andrew Lee alee...@hotmail.com wrote:
Hi N,
See:
Hi guys,
Trying to use the SparkSQL Thriftserver with hive metastore. It seems that
hive meta impersonation works fine (when running Hive tasks). However
spinning up SparkSQL thrift server, impersonation doesn't seem to work...
What settings do I need to enable impersonation?
I've copied the
Hi guys,
Having a problem build a DataFrame in Spark SQL from a JDBC data source
when running with --master yarn-client and adding the JDBC driver JAR with
--jars. If I run with a local[*] master all works fine.
./bin/spark-shell --jars /tmp/libs/mysql-jdbc.jar --master yarn-client
cluster, into a common
location.
On Thu, Apr 23, 2015 at 6:38 PM, Night Wolf nightwolf...@gmail.com
wrote:
Hi guys,
Having a problem build a DataFrame in Spark SQL from a JDBC data source
when
running with --master yarn-client and adding the JDBC driver JAR with
--jars. If I run
Hey,
Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with
thrift server).
Running;
*mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive
-Phive-thriftserver clean install*
The build fails with;
INFO] Compiling 9 Scala sources to
Was a solution ever found for this. Trying to run some test cases with sbt
test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get
this error. Setting fork := true in sbt seems to work but its a less than
idea work around.
On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles
Tried with that. No luck. Same error on abt-interface jar. I can see maven
downloaded that jar into my .m2 cache
On Friday, March 6, 2015, 鹰 980548...@qq.com wrote:
try it with mvn -DskipTests -Pscala-2.11 clean install package
Hey guys,
Trying to build Spark 1.3 for Scala 2.11.
I'm running with the folllowng Maven command;
-DskipTests -Dscala-2.11 clean install package
*Exception*:
[ERROR] Failed to execute goal on project spark-core_2.10: Could not
resolve dependencies for project
Hey,
Trying to build latest spark 1.3 with Maven using
-DskipTests clean install package
But I'm getting errors with zinc, in the logs I see;
[INFO]
*--- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-network-common_2.11 --- *
...
[error] Required file not found:
'); wrote:
Shark's in-memory code was ported to Spark SQL and is used by default
when you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com
javascript
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using
Hi,
I just built Spark 1.3 master using maven via make-distribution.sh;
./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive
-Phive-thriftserver -Phive-0.12.0
When trying to start the standalone spark master on a cluster I get the
following stack trace;
15/02/04 08:53:56
In Spark SQL we have Row objects which contain a list of fields that make
up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2).
Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what
ordinal is what, making the code confusing.
Say for example I have the
Hi all,
I'd like to leverage some of the fast Spark collection implementations in
my own code.
Particularity for doing things like distinct counts in a mapPartitions
loop.
Are there any plans to make the org.apache.spark.util.collection
implementations public? Is there any other library out
it
from a source under test as Intellij won't provide the provided scope
libraries when running code in main source (but it will for sources under
test).
With this config you can sbt assembly in order to get the fat jar
without Spark jars.
ᐧ
On Tue, Jan 13, 2015 at 12:16 PM, Night Wolf
Hi,
I'm trying to load up an SBT project in IntelliJ 14 (windows) running 1.7
JDK, SBT 0.13.5 -I seem to be getting errors with the project.
The build.sbt file is super simple;
name := scala-spark-test1
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %%
Hi,
Just to give some context. We are using Hive metastore with csv Parquet
files as a part of our ETL pipeline. We query these with SparkSQL to do
some down stream work.
I'm curious whats the best way to go about testing Hive SparkSQL? I'm
using 1.1.0
I see that the LocalHiveContext has been
40 matches
Mail list logo