Spark Streaming with Confluent

2017-12-13 Thread Arkadiusz Bicz
Hi, I try to test spark streaming 2.2.0 version with confluent 3.3.0 I have got lot of error during compilation this is my sbt: lazy val sparkstreaming = (project in file(".")) .settings( name := "sparkstreaming", organization := "org.arek", version := "0.1-SNAPSHOT", scalaVersion

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Arkadiusz Bicz
You don't need multiple spark sessions to have more than one stream working, but from maintenance and reliability perspective it is not good idea. On Thu, Sep 7, 2017 at 2:40 AM, kant kodali wrote: > Hi All, > > I am wondering if it is ok to have multiple sparksession's in

Re: Cloudera 5.8.0 and spark 2.1.1

2017-05-17 Thread Arkadiusz Bicz
It is working fine, but it is not supported by Cloudera. On May 17, 2017 1:30 PM, "issues solution" wrote: > Hi , > it s possible to use prebuilt version of spark 2.1 inside cloudera 5.8 > where scala 2.1.0 not scala 2.1.1 and java 1.7 not java 1.8 > > Why ? > >

Support for decimal separator (comma or period) in spark 2.1

2017-02-23 Thread Arkadiusz Bicz
: spark.read.option("sep",";").option("header", "true").option("inferSchema", "true").format("csv").load("nonuslocalized.csv") If not should I create ticket in jira for this ? I can work on solution if not available. Best Regards, Arkadiusz Bicz

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Arkadiusz Bicz
Sorry, I've found one error: If you do NOT need any relational processing of your messages ( basing on historical data, or joining with other messages) and message processing is quite independent Kafka plus Spark Streaming could be overkill. On Tue, Apr 19, 2016 at 1:54 PM, Arkadiusz Bicz

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Arkadiusz Bicz
on it do access to your cache and disc. For cache for me most promising looks Alluxio. BR, Arkadiusz Bicz On Tue, Apr 19, 2016 at 6:01 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > Hi all, > I am looking for an architecture to ingest 10 mils of messages in the micro > bat

YARN vs Standalone Spark Usage in production

2016-04-14 Thread Arkadiusz Bicz
Hello, Is there any statistics regarding YARN vs Standalone Spark Usage in production ? I would like to choose most supported and used technology in production for our project. BR, Arkadiusz Bicz - To unsubscribe, e-mail

Re: How to update data saved as parquet in hdfs using Dataframes

2016-02-17 Thread Arkadiusz Bicz
Hi, Hdfs is append only, that you need to modify it as you read and write in other place. On Wed, Feb 17, 2016 at 2:45 AM, SRK wrote: > Hi, > > How do I update data saved as Parquet in hdfs using dataframes? If I use > SaveMode.Append, it just seems to append the data

Re: Memory problems and missing heartbeats

2016-02-16 Thread Arkadiusz Bicz
I had similar as #2 problem when I used lot of caching and then doing shuffling It looks like when I cached too much there was no enough space for other spark tasks and it just hang on. That you can try to cache less and see if improve, also executor logs help a lot (watch out logs with

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
Hi, You need good monitoring tools to send you alarms about disk, network or applications errors, but I think it is general dev ops work not very specific to spark or hadoop. BR, Arkadiusz Bicz https://www.linkedin.com/in/arkadiuszbicz On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson

Re: How to parallel read files in a directory

2016-02-12 Thread Arkadiusz Bicz
way lot of data. Also when using DataFrames there is huge overhead by caching files information as described in https://issues.apache.org/jira/browse/SPARK-11441 BR, Arkadiusz Bicz https://www.linkedin.com/in/arkadiuszbicz On Thu, Feb 11, 2016 at 7:24 PM, Jakob Odersky <ja...@odersky.com> wr

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
ithub.com/pabloa/grafana-alerts. I have not used grafana-alerts but looks promising. BR, Arkadiusz Bicz On Fri, Feb 12, 2016 at 4:38 PM, Andy Davidson <a...@santacruzintegration.com> wrote: > Hi Arkadiusz > > Do you have any suggestions? > > As an engineer I think when

Re: Generic Dataset Aggregator

2016-01-26 Thread Arkadiusz Bicz
= new AggregateResults().toColumn import sqlContext.implicits._ val dsResults = Seq(ResultSmallA("1", "1", Array[Double](1.0,2.0)), ResultSmallA("1", "1", Array[Double](1.0,2.0)) ).toDS() dsResults.groupBy(_.tradeId).agg(sumRes) Best Regards, Arkadiusz Bicz https

Re: best practice : how to manage your Spark cluster ?

2016-01-21 Thread Arkadiusz Bicz
-with-graphite-and-grafana/ BR, Arkadiusz Bicz On Thu, Jan 21, 2016 at 5:33 AM, charles li <charles.up...@gmail.com> wrote: > I've put a thread before: pre-install 3-party Python package on spark > cluster > > currently I use Fabric to manage my cluster , but it's not enou

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Arkadiusz Bicz
Why do you need to be only one file? Spark doing good job writing in many files. On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin wrote: > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like

DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Arkadiusz Bicz
Hi What is the proper configuration for saving parquet partition with large number of repeated keys? On bellow code I load 500 milion rows of data and partition it on column with not so many different values. Using spark-shell with 30g per executor and driver and 3 executor cores

Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, Including query plan : DataFrame : == Physical Plan == SortBasedAggregate(key=[agreement#23], functions=[(MaxVectorAggFunction(values#3),mode=Final,isDistinct=false)], output=[agreement#23,maxvalues#27]) +- ConvertToSafe +- Sort [agreement#23 ASC], false, 0 +- TungstenExchange

How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, I have done some performance tests by repeating execution with different number of executors and memory for YARN clustered Spark (version 1.6.0) ( cluster contains 6 large size nodes) I found Dataset joinWith or cogroup from 3 to 5 times slower then broadcast join in DataFrame, how to

Re: Job History Logs for spark jobs submitted on YARN

2016-01-12 Thread Arkadiusz Bicz
Hi, You can checkout http://spark.apache.org/docs/latest/monitoring.html, you can monitor hdfs, memory usage per job and executor and driver. I have connected it to Graphite for storage and Grafana for visualization. I have also connected to collectd which provides me all server nodes metrics

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Arkadiusz Bicz
Hi, There are some documentation in https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html and also you can check out tests of DatasetSuite in spark sources. BR, Arkadiusz Bicz On Mon, Jan 11, 2016 at 5:37 AM, Muthu Jayakumar <bablo...@gmail.

How HiveContext can read subdirectories

2016-01-07 Thread Arkadiusz Bicz
uot;create external table testsubdirectories (id string, value string) STORED AS PARQUET location '/tmp/df'") val hcall = hc.sql("select * from testsubdirectories") assert(hcall.count() == 6) //shoud return 6 but it is 0 as not read from subdirectories Thanks, Arkadiusz Bicz - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Spark DataFrame limit question

2016-01-06 Thread Arkadiusz Bicz
.parquet("/tmp/smallresults2") The same when I create external table in hive context as results table hiveContext.sql("select * from results limit 1").write.parquet("/tmp/results/one3") Thanks, Arkadiusz Bicz -

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Arkadiusz Bicz
Hello, Spark collect HDFS read/write metrics per application/job see details http://spark.apache.org/docs/latest/monitoring.html. I have connected spark metrics to Graphite and then doing nice graphs display on Graphana. BR, Arek On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran