Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
found it :) SPARK-1890 thanks cloud-fan On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers wrote: > trying to replicate this in spark itself i can for v2.1.0 but not for > master. i guess it has been fixed > > On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers

Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
trying to replicate this in spark itself i can for v2.1.0 but not for master. i guess it has been fixed On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers wrote: > i started printing out when kryo serializes my buffer data structure for > my aggregator. > > i would expect every

Re: Ingesting data in parallel across workers in Data Frame

2017-01-20 Thread Peyman Mohajerian
The next section in the same document has a solution. On Fri, Jan 20, 2017 at 9:03 PM, Abhishek Gupta wrote: > I am trying to load data from the database into DataFrame using JDBC > driver.I want to get data into partitions the following document has the > nice

Ingesting data in parallel across workers in Data Frame

2017-01-20 Thread Abhishek Gupta
I am trying to load data from the database into DataFrame using JDBC driver.I want to get data into partitions the following document has the nice explanation how to achieve so. https://docs.databricks.com/spark/latest/data-sources/sql-databases.html

Kryo (with Spark 1.6.3) class registration slows down processing

2017-01-20 Thread N B
Hello, Here is something I am unable to explain and goes against Kryo's documentation, numerous suggestions on the web and on this list as well as pure intuition. Our Spark application runs in a single JVM (perhaps this is relevant, hence mentioning it). We have been using Kryo serialization

Force mesos to provide GPUs to Spark

2017-01-20 Thread Ji Yan
Dear Spark Users, With the latest version of Spark and Mesos with GPU support, is there a way to guarantee a Spark job with specified number of GPUs? Currently the Spark job sets "spark.mesos.gpus.max" to ask for GPU resources, however this is an upper bound, which means that Spark will accept

Re: Dataframe caching

2017-01-20 Thread रविशंकर नायर
Thanks, Will look into this. Best regards, Ravion -- Forwarded message -- From: "Muthu Jayakumar" Date: Jan 20, 2017 10:56 AM Subject: Re: Dataframe caching To: "☼ R Nair (रविशंकर नायर)" Cc: "user@spark.apache.org"

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
nvm figured. I compiled my client jar with 2.0.2 while the spark that is deployed on my machines were 2.0.1. communication problems between dev team and ops team :) On Fri, Jan 20, 2017 at 3:03 PM, kant kodali wrote: > Is this because of versioning issue? can't wait for JDK

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
Is this because of versioning issue? can't wait for JDK 9 modular system. I am not sure if spark plans to leverage it? On Fri, Jan 20, 2017 at 1:30 PM, kant kodali wrote: > I get the following exception. I am using Spark 2.0.1 and Scala 2.11.8. > >

Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
i started printing out when kryo serializes my buffer data structure for my aggregator. i would expect every buffer object to ideally get serialized only once: at the end of the map-side before the shuffle (so after all the values for the given key within the partition have been reduced into it).

Re:

2017-01-20 Thread Keith Chapman
Hi Jacek, I've looked at SparkListener and tried it, I see it getting fired on the master but I don't see it getting fired on the workers in a cluster. Regards, Keith. http://keith-chapman.com On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski wrote: > Hi, > > (redirecting to

java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
I get the following exception. I am using Spark 2.0.1 and Scala 2.11.8. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 13, 172.31.20.212): java.io.InvalidClassException:

Re: New runtime exception after switch to Spark 2.1.0

2017-01-20 Thread Jacek Laskowski
Thanks for sharing! A very interesting reading indeed. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jan 20, 2017 at 10:17 PM, Morten Hornbech

is this something to worry about? HADOOP_HOME or hadoop.home.dir are not set

2017-01-20 Thread kant kodali
Hi, I am running spark standalone with no storage. when I use spark-submit to submit my job I get the following Exception and I wonder if this is something to worry about? *java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set*

Re: New runtime exception after switch to Spark 2.1.0

2017-01-20 Thread Morten Hornbech
Sure :-) Digging into the TargetInvocationException revealed a “NoSuchFieldError: DEFAULT_MAX_PENDING_TASKS", which we guessed was linked to some kind of binary incompatibility in the dependencies. Looking into the stack trace this could be traced to a dynamic constructor call in netty, and we

Fwd: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-20 Thread Shixiong(Ryan) Zhu
-- Forwarded message -- From: Shixiong(Ryan) Zhu Date: Fri, Jan 20, 2017 at 12:06 PM Subject: Re: Spark streaming app that processes Kafka DStreams produces no output and no error To: shyla deshpande That's how KafkaConsumer

Re: spark 2.02 error when writing to s3

2017-01-20 Thread Neil Jonkers
Can you test by enabling emrfs consistent view and use s3:// uri. http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html Original message From: Steve Loughran Date:20/01/2017 21:17 (GMT+02:00) To: "VND Tremblay, Paul"

Re: New runtime exception after switch to Spark 2.1.0

2017-01-20 Thread Jacek Laskowski
Hi, I'd be very interested in how you figured it out. Mind sharing? Jacek On 18 Jan 2017 9:51 p.m., "mhornbech" wrote: > For anyone revisiting this at a later point, the issue was that Spark 2.1.0 > upgrades netty to version 4.0.42 which is not binary compatible with >

Re: spark 2.02 error when writing to s3

2017-01-20 Thread Steve Loughran
AWS S3 is eventually consistent: even after something is deleted, a LIST/GET call may show it. You may be seeing that effect; even after the DELETE has got rid of the files, a listing sees something there, And I suspect the time it takes for the listing to "go away" will depend on the total

Re:

2017-01-20 Thread Jacek Laskowski
Hi, (redirecting to users as it has nothing to do with Spark project development) Monitor jobs and stages using SparkListener and submit cleanup jobs where a condition holds. Jacek On 20 Jan 2017 3:57 a.m., "Keith Chapman" wrote: > Hi , > > Is it possible for an

RE: spark 2.02 error when writing to s3

2017-01-20 Thread VND Tremblay, Paul
I am using an EMR cluster, and the latest version offered is 2.02. The link below indicates that that user had the same problem, which seems unresolved. Thanks Paul _ Paul Tremblay Analytics

FunctionRegistry

2017-01-20 Thread Bowden, Chris
Thoughts on exposing FunctionRegistry via ExperimentalMethods? I have functionality which can not be expressed efficiently via UDFs, consequently I implement my own Expressions. Currently I have to lift access to FunctionRegistry in my project(s) within org.apache.spark.sql.*. I also have to

too noisy

2017-01-20 Thread Alvin Chen
hi, this mailing list is too noisy. is there another one i can sign up for that only includes releases and announcements? thanks, Alvin

Re: Differing triplet and vertex data

2017-01-20 Thread lbollar
Sorry, wrong link above. http://apache-spark-user-list.1001560.n3.nabble.com/Differing-triplet-and-vertex-data-td28330.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Differing-triplet-and-vertex-data-tp28330p28331.html Sent from the Apache Spark User

Differing triplet and vertex data

2017-01-20 Thread lbollar
Hello all, Found this previous post but didn't see answer that appears related. http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page=1=triplet+vertex I have a graph algorithm in spark 1.6.2 where I am implementing Louvain Modularity. The implementation

Re: Dataframe caching

2017-01-20 Thread Muthu Jayakumar
I guess, this may help in your case? https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view Thanks, Muthu On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Dear all, > > Here is a requirement I am thinking of

Dataframe caching

2017-01-20 Thread रविशंकर नायर
Dear all, Here is a requirement I am thinking of implementing in Spark core. Please let me know if this is possible, and kindly provide your thoughts. A user executes a query to fetch 1 million records from , let's say a database. We let the user store this as a dataframe, partitioned across

Running Hive Beeline .hql file in Spark

2017-01-20 Thread Ravi Prasad
Hi , Currently we are running Hive Beeline queries as below. *Beeline :-* beeline -u "jdbc:hive2://localhost:1/default;principal=hive/_ h...@nsroot.net" --showHeader=false --silent=true --outputformat=dsv --verbose =false -f /home/*sample.hql *> output_partition.txt Note : We run the

Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread Jörn Franke
Sure but pairrdd is part of the Spark libraries and should work in the cloud. > On 20 Jan 2017, at 13:24, A Shaikh wrote: > > Thanks for responding Jorn. Currently I upload the jar to Google Cloud and > run my job not ideal for development. Do you know if we can run

Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread A Shaikh
Thanks for responding Jorn. Currently I upload the jar to Google Cloud and run my job not ideal for development. Do you know if we can run this from within our local machine? given that all the required jars are downloaded by SBT anyways. On 20 January 2017 at 11:22, Jörn Franke

help,I want to call spark-submit from java shell

2017-01-20 Thread lk_spark
hi,all: undering spark2.0 with hadoop 2.7.2 my code like this: String c1 = "/bin/sh"; String c2 = "-c"; StringBuilder sb = new StringBuilder("cd /home/hadoop/dmp/spark-2.0.2-bin-hadoop2.7/bin;spark-submit --class com.hua.spark.dataload.DataLoadFromBase64JSON --master yarn

Re: "Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-20 Thread Steve Loughran
On 19 Jan 2017, at 10:59, Sean Owen > wrote: It's a message from Hadoop libs, not Spark. It can be safely ignored. It's just saying you haven't installed the additional (non-Apache-licensed) native libs that can accelerate some operations. This is

Re: Anyone has any experience using spark in the banking industry?

2017-01-20 Thread Steve Loughran
> On 18 Jan 2017, at 21:50, kant kodali wrote: > > Anyone has any experience using spark in the banking industry? I have couple > of questions. > 2. How can I make spark cluster highly available across multi datacenter? Any > pointers? That's not, AFAIK, been a design

Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread Jörn Franke
It is only on pairdd > On 20 Jan 2017, at 11:54, A Shaikh wrote: > > Has anyone experience saving Dataset to Bigquery Table? > > I am loading into BigQuery using the following example sucessfully. This uses > RDD.saveAsNewAPIHadoopDataset method to save data. > I am

Re: Non-linear (curved?) regression line

2017-01-20 Thread Sean Owen
I don't think this is a Spark question. This isn't a problem you solve by throwing all combinations of options at it. Your target is not a linear function of input, or its square, and it's not a question of GLM link function. You may need to look at the log-log plot because this looks like a

Saving from Dataset to Bigquery Table

2017-01-20 Thread A Shaikh
Has anyone experience saving Dataset to Bigquery Table? I am loading into BigQuery using the following example sucessfully. This uses RDD.saveAsNewAPIHadoopDataset method to save data. I am using Dataset(or DataFrame) and

Re: physical memory usage keep increasing for spark app on Yarn

2017-01-20 Thread Pavel Plotnikov
Hi Yang, i have faced with the same problem on Mesos and to circumvent this issue i am usually increase partition number. On last step in your code you reduce number of partitions to 1, try to set bigger value, may be it solve this problem. Cheers, Pavel On Fri, Jan 20, 2017 at 12:35 PM Yang Cao

physical memory usage keep increasing for spark app on Yarn

2017-01-20 Thread Yang Cao
Hi all, I am running a spark application on YARN-client mode with 6 executors (each 4 cores and executor memory = 6G and Overhead = 4G, spark version: 1.6.3 / 2.1.0). I find that my executor memory keeps increasing until get killed by node manager; and give out the info that tells me to boost

Re: TDD in Spark

2017-01-20 Thread A Shaikh
Thanks for all the suggestion. Very Helpful. On 17 January 2017 at 22:04, Lars Albertsson wrote: > My advice, short version: > * Start by testing one job per test. > * Use Scalatest or a standard framework. > * Generate input datasets with Spark routines, write to local file.