dataset algos slow because of too many shuffles

2017-02-02 Thread Koert Kuipers
we noticed that some algos we ported from rdd to dataset are significantly slower, and the main reason seems to be more shuffles that we successfully avoid for rdds by careful partitioning. this seems to be dataset specific as it works ok for dataframe. see also here:

persistence iops and throughput check? Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Heji Kim
Dear Anahita, When we run performance tests for Spark/YARN clusters on GCP, we have to make sure we are within iops and throughput limits. Depending on disk type (standard or SSD) and size of disk, you will only get so many max sustained iops and throughput per sec. The GCP instance metrics

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread ji yan
got it, thanks for clarifying! On Thu, Feb 2, 2017 at 2:57 PM, Michael Gummelt wrote: > Yes, that's expected. spark.executor.cores sizes a single executor. It > doesn't limit the number of executors. For that, you need spark.cores.max > (--total-executor-cores). > >

Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
i wrote this piece based on all that, hopefully it will help: http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ > On Jan 31, 2017, at 4:18 PM, Burak Yavuz wrote: > > Hi Koert, > > When

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Michael Gummelt
Yes, that's expected. spark.executor.cores sizes a single executor. It doesn't limit the number of executors. For that, you need spark.cores.max (--total-executor-cores). And rdd.parallelize does not specify the number of executors. It specifies the number of partitions, which relates to the

Re: HBase Spark

2017-02-02 Thread Benjamin Kim
Hi Asher, I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java (1.8) version as our installation. The Scala (2.10.5) version is already the same as ours. But I’m still getting the same error. Can you think of anything else? Cheers, Ben > On Feb 2, 2017, at 11:06 AM, Asher

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I tried setting spark.executor.cores per executor, but Spark seems to be spinning up as many executors as possible up to spark.cores.max or however many cpu cores available on the cluster, and this may be undesirable because the number of executors in rdd.parallelize(collection, # of partitions)

4 days left to submit your abstract to Spark Summit SF

2017-02-02 Thread Scott walent
We are just 4 days away from closing the CFP for Spark Summit 2017. We have expanded the tracks in SF to include sessions that focus on AI, Machine Learning and a 60 min deep dive track with technical demos. Submit your presentation today and join us for the 10th Spark Summit! Hurry, the CFP

Spark: Scala Shell Very Slow (Unresponsive)

2017-02-02 Thread jimitkr
Friends, After i launch spark-shell, the default Scala shell appears but is unresponsive. When i type any command on the shell, nothing appears on my screen shell is completely unresponsive. My server has 32 gigs of memory and approx 18 GB is empty after launching spark-shell, so it may

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Michael Gummelt
As of Spark 2.0, Mesos mode does support setting cores on the executor level, but you might need to set the property directly (--conf spark.executor.cores=). I've written about this here: https://docs.mesosphere.com/1.8/usage/service-guides/spark/job-scheduling/. That doc is for DC/OS, but the

Re: Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jörn Franke
Not sure what your udf is exactly doing, but why not on udf / type ? You avoid issues converting it, it is more obvious for the user of your udf etc You could of course return a complex type with one long, one string and one double and you fill them in the udf as needed, but this would be

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I was mainly confused why this is the case with memory, but with cpu cores, it is not specified on per executor level On Thu, Feb 2, 2017 at 1:02 PM, Michael Gummelt wrote: > It sounds like you've answered your own question, right? > --executor-memory means the memory

Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jean Georges Perrin
Hi fellow Sparkans, I am building a UDF (in Java) that can return various data types, basically the signature of the function itself is: public Object call(String a, Object b, String c, Object d, String e) throws Exception When I register my function, I need to provide a type, e.g.:

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Michael Gummelt
It sounds like you've answered your own question, right? --executor-memory means the memory per executor. If you have no executor w/ 200GB memory, then the driver will accept no offers. On Thu, Feb 2, 2017 at 1:01 PM, Ji Yan wrote: > sorry, to clarify, i was using

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
sorry, to clarify, i was using --executor-memory for memory, and --total-executor-cores for cpu cores On Thu, Feb 2, 2017 at 12:56 PM, Michael Gummelt wrote: > What CLI args are your referring to? I'm aware of spark-submit's > arguments (--executor-memory,

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Michael Gummelt
What CLI args are your referring to? I'm aware of spark-submit's arguments (--executor-memory, --total-executor-cores, and --executor-cores) On Thu, Feb 2, 2017 at 12:41 PM, Ji Yan wrote: > I have done a experiment on this today. It shows that only CPUs are > tolerant of

Spark 2 - Creating datasets from dataframes with extra columns

2017-02-02 Thread Don Drake
In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns not in the case class were dropped from the Dataset. For example in 1.6, the column c4 is gone: scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import sqlContext.implicits._

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I have done a experiment on this today. It shows that only CPUs are tolerant of insufficient cluster size when a job starts. On my cluster, I have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos ) with --cpu_cores set to 1000, the job starts up with 64 cores. but when I set

Re: pivot over non numerical data

2017-02-02 Thread Darshan Pandya
Thanks Kevin, Worked like a charm. FYI for readers, val temp1 = temp.groupBy("reference_id").pivot("char_name").agg(max($"char_value")) I didn't know I can use 'agg' with a string max. I was using it incorrectly as below temp.groupBy("reference_id").pivot("char_name").max("char_value") On Wed,

Re: frustration with field names in Dataset

2017-02-02 Thread Koert Kuipers
great its an easy fix. i will create jira and pullreq On Thu, Feb 2, 2017 at 2:13 PM, Michael Armbrust wrote: > That might be reasonable. At least I can't think of any problems with > doing that. > > On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers

Re: frustration with field names in Dataset

2017-02-02 Thread Michael Armbrust
That might be reasonable. At least I can't think of any problems with doing that. On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers wrote: > since a dataset is a typed object you ideally don't have to think about > field names. > > however there are operations on Dataset that

Re: HBase Spark

2017-02-02 Thread Asher Krim
Ben, That looks like a scala version mismatch. Have you checked your dep tree? Asher Krim Senior Software Engineer On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim wrote: > Elek, > > Can you give me some sample code? I can’t get mine to work. > > import

Re: HBase Spark

2017-02-02 Thread Benjamin Kim
Elek, Can you give me some sample code? I can’t get mine to work. import org.apache.spark.sql.{SQLContext, _} import org.apache.spark.sql.execution.datasources.hbase._ import org.apache.spark.{SparkConf, SparkContext} def cat = s"""{ |"table":{"namespace":"ben", "name":"dmp_test",

[ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-02 Thread Hollin Wilkins
Hey everyone, Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about MLeap and how you can use it to build production services from your Spark-trained ML pipelines. MLeap is an open-source technology that allows Data Scientists and Engineers to deploy Spark-trained ML

Re: frustration with field names in Dataset

2017-02-02 Thread Koert Kuipers
another example is if i have a Dataset[(K, V)] and i want to repartition it by the key K. repartition requires a Column which means i am suddenly back to worrying about duplicate field names. i would like to be able to say: dataset.repartition(dataset(0)) On Thu, Feb 2, 2017 at 10:39 AM, Koert

frustration with field names in Dataset

2017-02-02 Thread Koert Kuipers
since a dataset is a typed object you ideally don't have to think about field names. however there are operations on Dataset that require you to provide a Column, like for example joinWith (and joinWith returns a strongly typed Dataset, not DataFrame). once you have to provide a Column you are

Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Anahita Talebi
Thanks for your answer. do you mean Amazon EMR? On Thu, Feb 2, 2017 at 2:30 PM, Marco Mistroni wrote: > U can use EMR if u want to run. On a cluster > Kr > > On 2 Feb 2017 12:30 pm, "Anahita Talebi" > wrote: > >> Dear all, >> >> I am trying

Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Marco Mistroni
U can use EMR if u want to run. On a cluster Kr On 2 Feb 2017 12:30 pm, "Anahita Talebi" wrote: > Dear all, > > I am trying to run a spark code on multiple machines using submit job in > google cloud platform. > As the inputs of my code, I have a training and

Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Anahita Talebi
Dear all, I am trying to run a spark code on multiple machines using submit job in google cloud platform. As the inputs of my code, I have a training and testing datasets. When I use small training data set like (10kb), the code can be successfully ran on the google cloud while when I have a

Re: FP growth - Items in a transaction must be unique

2017-02-02 Thread Patrick Plaatje
Hi, This indicates you have duplicate products per row in your dataframe, the FP implementation only allows unique products per row, so you will need to dedupe duplicate products before running the FPGrowth algorithm. Best, Patrick From: "Devi P.V" Date:

Re: Suprised!!!!!Spark-shell showing inconsistent results

2017-02-02 Thread Marco Mistroni
Hi Have u tried to sort the results before comparing? On 2 Feb 2017 10:03 am, "Alex" wrote: > Hi As shown below same query when ran back to back showing inconsistent > results.. > > testtable1 is Avro Serde table... > > [image: Inline image 1] > > > > hc.sql("select *

Re: Suprised!!!!!Spark-shell showing inconsistent results

2017-02-02 Thread Didac Gil
Is 1570 the value of Col1? If so, you have ordered by that column and selected only the first item. It seems that both results have the same Col1 value, therefore any of them would be a right answer to return. Right? > On 2 Feb 2017, at 11:03, Alex wrote: > > Hi As shown

RE: filters Pushdown

2017-02-02 Thread vincent gromakowski
There are some native (in the doc) and some third party (in spark package https://spark-packages.org/?q=tags%3A"Data%20Sources;) Parquet is prefered native. Cassandra/filodb provides most advanced pushdown. Le 2 févr. 2017 11:23 AM, "Peter Shmukler" a écrit : > Hi Vincent, >

Re: filters Pushdown

2017-02-02 Thread ayan guha
Look for spark packages website. If your questions were targeted for hive, then i think in general all answers are yes On Thu, 2 Feb 2017 at 9:23 pm, Peter Shmukler wrote: > Hi Vincent, > > Thank you for answer. (I don’t see your answer in mailing list, so I’m > answering

RE: filters Pushdown

2017-02-02 Thread Peter Shmukler
Hi Vincent, Thank you for answer. (I don’t see your answer in mailing list, so I’m answering directly) What connectors can I work with from Spark? Can you provide any link to read about it because I see nothing in Spark documentation? From: vincent gromakowski

Re: filters Pushdown

2017-02-02 Thread vincent gromakowski
Pushdowns depend on the source connector. Join pushdown with Cassandra only Filter pushdown with mainly all sources with some specific constraints Le 2 févr. 2017 10:42 AM, "Peter Sg" a écrit : > Can community help me to figure out some details about Spark: > - Does

Re: Closing resources in the executor

2017-02-02 Thread Appu K
https://mid.mail-archive.com/search?l=user@spark.apache.org=subject:%22Executor+shutdown+hook+and+initialization%22=newest=1 I see this thread where it is mentioned that per-partition resource management is recommended over global state(within an executor) What would be the way to achieve this in

Suprised!!!!!Spark-shell showing inconsistent results

2017-02-02 Thread Alex
Hi As shown below same query when ran back to back showing inconsistent results.. testtable1 is Avro Serde table... [image: Inline image 1] hc.sql("select * from testtable1 order by col1 limit 1").collect; res14: Array[org.apache.spark.sql.Row] =

filters Pushdown

2017-02-02 Thread Peter Sg
Can community help me to figure out some details about Spark: - Does Spark support filter Pushdown for types: o Int/long o DateTime o String - Does Spark support Pushdown of join operations for partitioned tables (in case of join condition includes partitioning

Re: Is it okay to run Hive Java UDFS in Spark-sql. Anybody's still doing it?

2017-02-02 Thread Jörn Franke
There are many performance aspects here which may not only related to the UDF itself, but on configuration of platform, data etc. You seem to have a performance problem with your UDFs. Maybe you can elaborate on 1) what data you process (format, etc) 2) what you try to Analyse 3) how you

Is it okay to run Hive Java UDFS in Spark-sql. Anybody's still doing it?

2017-02-02 Thread Alex
Hi Team, Do you really think if we make Hive Java UDF's to run on spark-sql it will make performance difference??? IS anybody here actually doing it.. converting Hive UDF's to run on Spark-sql.. What would be your approach if asked to make Hive Java UDFS project run on spark-sql Would yu run