Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Yea, I'm looking for any personal experiences people have had with tools like these. > On Jul 8, 2016, at 8:57 PM, charles li wrote: > > Hi, Andrew, I've got lots of materials when asking google for "spark > performance test" > >

Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf -

Re: Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread Deepak Sharma
Yes .You can do something like this : .map(x=>mapfunction(x)) Thanks Deepak On 9 Jul 2016 9:22 am, "charles li" wrote: > > hi, guys, is there a way to dynamic load files within the map function. > > i.e. > > Can I code as bellow: > > > ​ > > thanks a lot. > ​ > > > -- >

Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread charles li
hi, guys, is there a way to dynamic load files within the map function. i.e. Can I code as bellow: ​ thanks a lot. ​ -- *___* ​ ​ Quant | Engineer | Boy *___* *blog*:http://litaotao.github.io *github*: www.github.com/litaotao

Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Hi group, What solutions are people using to do performance testing and tuning of spark applications? I have been doing a pretty manual technique where I lay out an Excel sheet of various memory settings and caching parameters and then execute each one by hand. It’s pretty tedious though, so

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-08 Thread mylisttech
Hi Mich, Would it be on YouTube , post session ? - Harmeet On Jul 7, 2016, at 3:07, Mich Talebzadeh wrote: > Dear forum members > > I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, > your mileage varies" in Future of Data: London >

Re: DataFrame Min By Column

2016-07-08 Thread Xinh Huynh
Hi Pedro, I could not think of a way using an aggregate. It's possible with a window function, partitioned on user and ordered by time: // Assuming "df" holds your dataframe ... import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val wSpec =

Broadcast hash join implementation in Spark

2016-07-08 Thread Lalitha MV
Hi, 1. What implementation is used for the hash join -- is it classic hash join or Hybrid grace hash join? 2. If the hash table does not fit in memory, does it spill or does it fail? Are there parameters to control this (for example to set the percentage of hash table which can spill etc.) 3. Is

Isotonic Regression, run method overloaded Error

2016-07-08 Thread dsp
Hi I am trying to perform Isotonic Regression on a data set with 9 features and a label. When I run the algorithm similar to the way mentioned on MLlib page, I get the error saying /*error:* overloaded method value run with alternatives: (input:

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-08 Thread Rabin Banerjee
Ya , I mean dump in hdfs as a file ,via yarn cluster mode . On Jul 8, 2016 3:10 PM, "Yu Wei" wrote: > How could I dump data into text file? Writing to HDFS or other approach? > > > Thanks, > > Jared > -- > *From:* Rabin Banerjee

DataFrame Min By Column

2016-07-08 Thread Pedro Rodriguez
Is there a way to on a GroupedData (from groupBy in DataFrame) to have an aggregate that returns column A based on a min of column B? For example, I have a list of sites visited by a given user and I would like to find the event with the minimum time (first event) Thanks, -- Pedro Rodriguez PhD

Re: Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Shixiong(Ryan) Zhu
Hey Tom, Could you provide all blocked threads? Perhaps due to some potential deadlock. On Fri, Jul 8, 2016 at 10:30 AM, Ellis, Tom (Financial Markets IT) < tom.el...@lloydsbanking.com.invalid> wrote: > Hi There, > > > > We’re currently using HDP 2.3.4, Spark 1.5.2 with a Spark Streaming job in

Re: Random Forest Classification

2016-07-08 Thread Bryan Cutler
Hi Rich, I looked at the notebook and it seems like you are fitting the StringIndexer and VectorIndexer to only the training data, and it should the the entire data set. So if the training data does not include all of the labels and an unknown label appears in the test data during evaluation,

Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Ellis, Tom (Financial Markets IT)
Hi There, We're currently using HDP 2.3.4, Spark 1.5.2 with a Spark Streaming job in YARN Cluster mode consuming from a high volume Kafka topic. When we try to access the Spark Streaming UI on the application master, it is unresponsive/hangs or sometimes comes back with connection refused. It

can I use ExectorService in my driver? was: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Andy Davidson
Hi Ewan Currently I split my dataframe into n smaller dataframes can call write.().json(³S3://³) Each data frame becomes a single S3 object. I assume for your solution to work I would need to reparation(1) each of the smaller sets so that they are written as a single s3 object. I am also

Iterate over columns in sql.dataframe

2016-07-08 Thread Pasquinell Urbani
Hi all I need to apply QuantileDiscretizer() over a 16 columns sql.dataframe. Which is the most efficient way to apply a function over each columns? Do I need to iterate over columns? Which is the best way to do this? Thank you all.

Spark Terasort Help

2016-07-08 Thread Punit Naik
Hi Guys I am trying to run spark terasort benchmark provided by ehiggs on github. Terasort on 1 gb, 10 gb and 100gb works fine. But when it comes to 1000 gb, the program seems to run into problems. The 1000 gb terasort actually completes on single-node

RE: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Ewan Leith
Writing (or reading) small files from spark to s3 can be seriously slow. You'll get much higher throughput by doing a df.foreachPartition(partition => ...) and inside each partition, creating an aws s3 client then doing a partition.foreach and uploading the files using that s3 client with its

Re: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Cody Koeninger
Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count them, likewise to infer the schema) but for 300 records that

Re: Spark as sql engine on S3

2016-07-08 Thread Mich Talebzadeh
You can have two approaches here. Use Hive as it is and replace Hive execution engine with Spark. You can beeline with Hive thrift server to access your Hive tables. beeline connects to the thrift server (either Hive or Spark). If you use spark thrift server with beeline then you are going to

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Chanh Le
Hi Gene, Thank for your support. I agree with you because of number executor but many parquet files influence to read performance so I need a way to improve that. So the way I work around is df.coalesce(1) .write.mode(SaveMode.Overwrite).partitionBy("network_id")

Re: Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Cody Koeninger
Yeah, it's a reasonable lowest common denominator between java and scala, and what's passed to that convenience constructor is actually what's used to construct the class. FWIW, in the 0.10 direct stream api when there's unavoidable wrapping / conversion anyway (since the underlying class takes a

Re: Memory grows exponentially

2016-07-08 Thread Cody Koeninger
Just as an offhand guess, are you doing something like updateStateByKey without expiring old keys? On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote: > Memory fragmentation? Quiet common with in-memory systems. > >> On 08 Jul 2016, at 08:56, aasish.kumar

Re: Simultaneous spark Jobs execution.

2016-07-08 Thread Mich Talebzadeh
If you are running in Local mode, then you can submit many jobs. As long as your hardware has resources to do multiple jobs there won't be any dependency. in other words each app (spark-submit) will run in its own JVM unaware of others. Local mode is good for testing. HTH Dr Mich Talebzadeh

RE: Spark with HBase Error - Py4JJavaError

2016-07-08 Thread Puneet Tripathi
Hi Ram, Thanks very much it worked. Puneet From: ram kumar [mailto:ramkumarro...@gmail.com] Sent: Thursday, July 07, 2016 6:51 PM To: Puneet Tripathi Cc: user@spark.apache.org Subject: Re: Spark with HBase Error - Py4JJavaError Hi Puneet, Have you tried appending --jars

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Gene Pang
Hi Chanh, You should be able to set the Alluxio block size with: sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb") I think you have many parquet files because you have many Spark executors writing out their partition of the files. Hope that helps, Gene On Sun, Jul

Re: Is the operation inside foreachRDD supposed to be blocking?

2016-07-08 Thread Mikael Ståldal
Yes, this is a stupid example. In my real code the processItem method is using some third-party library which does things asynchronously and returns a Future. On Fri, Jul 8, 2016 at 3:11 PM, Sean Owen wrote: > You can write this code. I don't think it will do anything

Re: Is the operation inside foreachRDD supposed to be blocking?

2016-07-08 Thread Sean Owen
You can write this code. I don't think it will do anything useful because you're executing asynchronously but then just blocking waiting for completion. It seems the same as just doing all the work in processItems() directly. On Fri, Jul 8, 2016 at 1:56 PM, Mikael Ståldal

Re: Simultaneous spark Jobs execution.

2016-07-08 Thread Jacek Laskowski
On 8 Jul 2016 2:03 p.m., "Mazen" wrote: > > Does Spark handle simulate nous execution of jobs within an application Yes. Run as many Spark jobs as you want and Spark will queue them given CPU and RAM available for you in the cluster. > job execution is blocking i.e. a

spark logging best practices

2016-07-08 Thread vimal dinakaran
Hi, http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala What is the best way to capture spark logs without getting task not serialzible error ? The above link has various workarounds. Also is there a way to dynamically set the log level when the application is running

Re: Is the operation inside foreachRDD supposed to be blocking?

2016-07-08 Thread Mikael Ståldal
I am not sure I fully understand your answer. Is this code correct? def main() { KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](streamingContext, configs, topics).foreachRDD { rdd => Await.ready(processItems(rdd.collect()), Duration.Inf) } } def

?????? Bug about reading parquet files

2016-07-08 Thread Sea
My spark version is 1.6.1. == Parsed Logical Plan == Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#112L] +- Limit 1 +- Aggregate [(count(1),mode=Complete,isDistinct=false) AS events#0L] +- Filter ((concat(year#1,month#2,day#3,hour#4) = 2016070700) && (appid#5 = 6))

Simultaneous spark Jobs execution.

2016-07-08 Thread Mazen
Does Spark handle simulate nous execution of jobs within an application or job execution is blocking i.e. a new job can not be initiated until the previous one commits. What does it mean that : "Spark’s scheduler is fully thread-safe" Thanks. -- View this message in context:

RangePartitioning

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder],

[no subject]

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder],

Re: Is the operation inside foreachRDD supposed to be blocking?

2016-07-08 Thread Sean Owen
It's no different than any other operation on an RDD. A transformation doesn't actually do anything by itself, so does not block. An action triggers computation and blocks until the action completes. You can wait for it with a Future, sure. On Fri, Jul 8, 2016 at 10:43 AM, Mikael Ståldal

Re: Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Sean Owen
Java-friendliness is the usual reason, though I don't know if that's the reason here. On Fri, Jul 8, 2016 at 10:42 AM, Mikael Ståldal wrote: > Is there any particular reason for the offsetRanges parameter to > KafkaUtils.createRDD is Array[OffsetRange]? Why not

Re: Spark as sql engine on S3

2016-07-08 Thread Ashok Kumar
Hi As I said we have using Hive asour SQL engine for the datasets but we are storing data externally in amazonS3,  Now you suggested Spark thrift server. Started Spark thrift server on port 10001 and I have used beeline that accesses thrift server.  Connecting to

Is the operation inside foreachRDD supposed to be blocking?

2016-07-08 Thread Mikael Ståldal
In a Spark Streaming job, is the operation inside foreachRDD supposed to synchronous / blocking? What if you do some asynchronous operation which returns a Future? Are you then supposed to do Await on that Future? -- [image: MagineTV] *Mikael Ståldal* Senior software developer *Magine TV*

Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Mikael Ståldal
Is there any particular reason for the offsetRanges parameter to KafkaUtils.createRDD is Array[OffsetRange]? Why not Seq[OffsetRange]? http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/kafka/KafkaUtils$.html

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-08 Thread Yu Wei
How could I dump data into text file? Writing to HDFS or other approach? Thanks, Jared From: Rabin Banerjee Sent: Thursday, July 7, 2016 7:04:29 PM To: Yu Wei Cc: Mich Talebzadeh; user; Deng Ching-Mallete Subject: Re: Is that

How to improve the performance for writing a data frame to a JDBC database?

2016-07-08 Thread Mungeol Heo
Hello, I am trying to write a data frame to a JDBC database, like SQL server, using spark 1.6.0. The problem is "write.jdbc(url, table, connectionProperties)" is too slow. Is there any way to improve the performance/speed? e.g. options like partitionColumn, lowerBound, upperBound, numPartitions

Re: Graphframe Error

2016-07-08 Thread Felix Cheung
I ran it with Python 2. On Thu, Jul 7, 2016 at 4:13 AM -0700, "Arun Patel" > wrote: I have tied this already. It does not work. What version of Python is needed for this package? On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung

Re: Bug about reading parquet files

2016-07-08 Thread Cheng Lian
What's the Spark version? Could you please also attach result of explain(extended = true)? On Fri, Jul 8, 2016 at 4:33 PM, Sea <261810...@qq.com> wrote: > I have a problem reading parquet files. > sql: > select count(1) from omega.dwd_native where year='2016' and month='07' > and day='05' and

Bug about reading parquet files

2016-07-08 Thread Sea
I have a problem reading parquet files. sql: select count(1) from omega.dwd_native where year='2016' and month='07' and day='05' and hour='12' and appid='6'; The hive partition is (year,month,day,appid) only two tasks, and it will list all directories in my table, not only

Re: Extend Dataframe API

2016-07-08 Thread Rishi Mishra
Or , you can extend SQLContext to add your plans . Not sure if it fits your requirement , but answered to highlight an option. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Thu, Jul 7, 2016 at 8:39 PM, tan shai

Re: Multiple aggregations over streaming dataframes

2016-07-08 Thread Arnaud Bailly
Thanks for your answers. I know Kafka's model but I would rather like to avoid having to setup both Spark and Kafka to handle my use case. I wonder if it might be possible to handle that using Spark's standard streams ? -- Arnaud Bailly twitter: abailly skype: arnaud-bailly linkedin:

Re: Processing json document

2016-07-08 Thread Jörn Franke
You are correct, although I think there is a way to proper identify the json even in case it is splitted ( i think this should be supported by the json parser). Nevertheless - as exchange format in Big Data platforms you should use Avro and for tabular analytics ORC or Parquet... Nevertheless,

Re: Memory grows exponentially

2016-07-08 Thread Jörn Franke
Memory fragmentation? Quiet common with in-memory systems. > On 08 Jul 2016, at 08:56, aasish.kumar wrote: > > Hello everyone: > > I have been facing a problem associated spark streaming memory. > > I have been running two Spark Streaming jobs concurrently. The jobs

Memory grows exponentially

2016-07-08 Thread aasish.kumar
Hello everyone: I have been facing a problem associated spark streaming memory. I have been running two Spark Streaming jobs concurrently. The jobs read data from Kafka with a batch interval of 1 minute, performs aggregation, and sinks the computed data to MongoDB using using stratio-mongodb

回复:Re: Re: how to select first 50 value of each group after group by?

2016-07-08 Thread luohui20001
Thank you Anton I got my problem solved as below codeval hivetable = hc.sql("select * from house_sale_pv_location") val overLocation = Window.partitionBy(hivetable.col("lp_location_id")) val sortedDF = hivetable.withColumn("rowNumber",