Re: DataFrame job fails on parsing error, help?

2016-04-28 Thread Night Wolf
We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo enabled & sort based shuffle. Did you find a resolution? On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu wrote: > Not much. > > So no chance of different snappy version ? > > On Fri, Apr 8, 2016 at 1:26 PM,

Total Task size exception in Spark 1.6.0 when writing a DataFrame

2016-01-17 Thread Night Wolf
Hi all, Doing some simple column transformations (e.g. trimming strings) on a DataFrame using UDFs. This DataFrame is in Avro format and being loaded off HDFS. The job has about 16,000 parts/tasks. About half way through the job, then fails with a message; org.apache.spark.SparkException: Job

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
a array double with 3 fields, the prediction, the class A probability and the class B probability. How could I make those like 3 columns from my expression? Clearly .withColumn only expects 1 column back. On Tue, Sep 8, 2015 at 6:21 PM, Night Wolf <nightwolf...@gmail.com> wrote: > Sorry for

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
ely a string where parameters are Comma-separated... > > Le lun. 7 sept. 2015 à 8:35, Night Wolf <nightwolf...@gmail.com> a écrit : > >> Is it possible to have a UDF which takes a variable number of arguments? >> >> e.g. df.select(myUdf($"*")) fails with &

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
t 5:47 PM, Night Wolf <nightwolf...@gmail.com> wrote: > So basically I need something like > > df.withColumn("score", new Column(new Expression { > ... > > def eval(input: Row = null): EvaluatedType = myModel.score(input) > ... > > })) > > But

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
ue or some struct... On Tue, Sep 8, 2015 at 5:33 PM, Night Wolf <nightwolf...@gmail.com> wrote: > Not sure how that would work. Really I want to tack on an extra column > onto the DF with a UDF that can take a Row object. > > On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke <jornfr

Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Night Wolf
Is it possible to have a UDF which takes a variable number of arguments? e.g. df.select(myUdf($"*")) fails with org.apache.spark.sql.AnalysisException: unresolved operator 'Project [scalaUDF(*) AS scalaUDF(*)#26]; What I would like to do is pass in a generic data frame which can be then passed

SPARK_DIST_CLASSPATH, primordial class loader app ClassNotFound

2015-08-26 Thread Night Wolf
Hey all, I'm trying to do some stuff with a YAML file in the Spark driver using SnakeYAML library in scala. When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to de-serialize some objects from YAML into classes in my app JAR on the driver (only the driver). I get the

SparkSQL - understanding Cross Joins

2015-06-25 Thread Night Wolf
Hi guys, I'm trying to do a cross join (cartesian product) with 3 tables stored as parquet. Each table has 1 column, a long key. Table A has 60,000 keys with 1000 partitions Table B has 1000 keys with 1 partition Table C has 4 keys with 1 partition The output should be 240million row

Re: Join highly skewed datasets

2015-06-15 Thread Night Wolf
How far did you get? On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: We use Scoobi + MR to perform joins and we particularly use blockJoin() API of scoobi /** Perform an equijoin with another distributed list where this list is considerably smaller * than the

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
: Running task 11093.0 in stage 0.0 (TID 9552) 15/06/16 13:43:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9553 15/06/16 13:43:22 INFO executor.Executor: Running task 10323.1 in stage 0.0 (TID 9553) On Tue, Jun 16, 2015 at 1:47 PM, Night Wolf nightwolf...@gmail.com wrote: Hi guys

Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
Hi guys, Using Spark 1.4, trying to save a dataframe as a table, a really simple test, but I'm getting a bunch of NPEs; The code Im running is very simple; qc.read.parquet(/user/sparkuser/data/staged/item_sales_basket_id.parquet).write.format(parquet).saveAsTable(is_20150617_test2) Logs of

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
tasks or regular tasks (the first attempt of the task)? Is this error deterministic (can you reproduce every time you run this command)? Thanks, Yin On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf nightwolf...@gmail.com wrote: Looking at the logs of the executor, looks like it fails to find

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Night Wolf
? spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni https://issues.apache.org/jira/browse/SPARK-7819 has more context about it. On Wed, Jun 3, 2015 at 9:38 PM, Night Wolf nightwolf

Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf
Hi all, Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with Hive support. *Build command;* ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided -Dhadoop.version=2.5.1-mapr-1501

Re: Spark 1.4 YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
: Opening proxy : qtausc-pphd0177.hadoop.local:40237 15/06/03 10:34:31 INFO impl.AMRMClientImpl: Received new token for : qtausc-pphd0132.hadoop.local:44108 15/06/03 10:34:31 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 0 of them. On Wed, Jun 3, 2015 at 10:29 AM, Night

Re: Spark 1.4 YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
1.3 and 1.4; it also has been working fine for me. Are you sure you're using exactly the same Hadoop libraries (since you're building with -Phadoop-provided) and Hadoop configuration in both cases? On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf nightwolf...@gmail.com wrote: Hi all, Trying out

Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Night Wolf
Hi all, I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 rows in = RDD 2000 rows out). The reason for this is each row is tagged with a list of the 'buckets' or 'windows' it belongs to. The actual data is about 10 billion rows. Each executor has 60GB of memory.

Spark Sorted DataFrame Repartitioning

2015-05-13 Thread Night Wolf
Hi guys, If I load a dataframe via a sql context that has a SORT BY in the query and I want to repartition the data frame will it keep the sort order in each partition? I want to repartition because I'm going to run a Map that generates lots of data internally so to avoid Out Of Memory errors I

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Night Wolf
I'm seeing a similar thing with a slightly different stack trace. Ideas? org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread Night Wolf
Seeing similar issues, did you find a solution? One would be to increase the number of partitions if you're doing lots of object creation. On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com fightf...@163.com wrote: Hi, patrick Really glad to get your reply. Yes, we are doing group by

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-07 Thread Night Wolf
? I was experimenting with Row class in python and apparently partitionby automatically takes first column as key. However, I am not sure how you can access a part of an object without deserializing it (either explicitly or Spark doing it for you) On Wed, May 6, 2015 at 7:14 PM, Night Wolf

Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread Night Wolf
Hi, If I have an RDD[MyClass] and I want to partition it by the hash code of MyClass for performance reasons, is there any way to do this without converting it into a PairRDD RDD[(K,V)] and calling partitionBy??? Mapping it to a tuple2 seems like a waste of space/computation. It looks like the

Re: Spark SQL ThriftServer Impersonation Support

2015-05-03 Thread Night Wolf
Thanks Andrew. What version of HS2 is the SparkSQL thrift server using? What would be involved in updating? Is it a simple case of increasing the deep version in one of the project POMs? Cheers, ~N On Sat, May 2, 2015 at 11:38 AM, Andrew Lee alee...@hotmail.com wrote: Hi N, See:

Spark SQL ThriftServer Impersonation Support

2015-05-01 Thread Night Wolf
Hi guys, Trying to use the SparkSQL Thriftserver with hive metastore. It seems that hive meta impersonation works fine (when running Hive tasks). However spinning up SparkSQL thrift server, impersonation doesn't seem to work... What settings do I need to enable impersonation? I've copied the

Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
Hi guys, Having a problem build a DataFrame in Spark SQL from a JDBC data source when running with --master yarn-client and adding the JDBC driver JAR with --jars. If I run with a local[*] master all works fine. ./bin/spark-shell --jars /tmp/libs/mysql-jdbc.jar --master yarn-client

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
cluster, into a common location. On Thu, Apr 23, 2015 at 6:38 PM, Night Wolf nightwolf...@gmail.com wrote: Hi guys, Having a problem build a DataFrame in Spark SQL from a JDBC data source when running with --master yarn-client and adding the JDBC driver JAR with --jars. If I run

Spark 1.3 build with hive support fails on JLine

2015-03-30 Thread Night Wolf
Hey, Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with thrift server). Running; *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive -Phive-thriftserver clean install* The build fails with; INFO] Compiling 9 Scala sources to

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-23 Thread Night Wolf
Was a solution ever found for this. Trying to run some test cases with sbt test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get this error. Setting fork := true in sbt seems to work but its a less than idea work around. On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread Night Wolf
Tried with that. No luck. Same error on abt-interface jar. I can see maven downloaded that jar into my .m2 cache On Friday, March 6, 2015, 鹰 980548...@qq.com wrote: try it with mvn -DskipTests -Pscala-2.11 clean install package

Building Spark 1.3 for Scala 2.11 using Maven

2015-03-05 Thread Night Wolf
Hey guys, Trying to build Spark 1.3 for Scala 2.11. I'm running with the folllowng Maven command; -DskipTests -Dscala-2.11 clean install package *Exception*: [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dependencies for project

Compile Spark with Maven Zinc Scala Plugin

2015-03-05 Thread Night Wolf
Hey, Trying to build latest spark 1.3 with Maven using -DskipTests clean install package But I'm getting errors with zinc, in the logs I see; [INFO] *--- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-network-common_2.11 --- * ... [error] Required file not found:

Re: Columnar-Oriented RDDs

2015-03-01 Thread Night Wolf
'); wrote: Shark's in-memory code was ported to Spark SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE. I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com javascript

Columnar-Oriented RDDs

2015-02-13 Thread Night Wolf
Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using

Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Night Wolf
Hi, I just built Spark 1.3 master using maven via make-distribution.sh; ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive -Phive-thriftserver -Phive-0.12.0 When trying to start the standalone spark master on a cluster I get the following stack trace; 15/02/04 08:53:56

Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Night Wolf
In Spark SQL we have Row objects which contain a list of fields that make up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2). Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing. Say for example I have the

Fast HashSets HashMaps - Spark Collection Utils

2015-01-14 Thread Night Wolf
Hi all, I'd like to leverage some of the fast Spark collection implementations in my own code. Particularity for doing things like distinct counts in a mapPartitions loop. Are there any plans to make the org.apache.spark.util.collection implementations public? Is there any other library out

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf
it from a source under test as Intellij won't provide the provided scope libraries when running code in main source (but it will for sources under test). With this config you can sbt assembly in order to get the fat jar without Spark jars. ᐧ On Tue, Jan 13, 2015 at 12:16 PM, Night Wolf

Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf
Hi, I'm trying to load up an SBT project in IntelliJ 14 (windows) running 1.7 JDK, SBT 0.13.5 -I seem to be getting errors with the project. The build.sbt file is super simple; name := scala-spark-test1 version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %%

SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf
Hi, Just to give some context. We are using Hive metastore with csv Parquet files as a part of our ETL pipeline. We query these with SparkSQL to do some down stream work. I'm curious whats the best way to go about testing Hive SparkSQL? I'm using 1.1.0 I see that the LocalHiveContext has been