Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
On 28 Sep 2017, at 15:27, Daniel Siegmann > wrote: Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does it use InputFormat do create multiple splits and creates 1 partition per split?

Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
> On 28 Sep 2017, at 14:45, ayan guha wrote: > > Hi > > Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text > file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Yes, Input formats give you their splits,

Re: More instances = slower Spark job

2017-10-01 Thread Jeroen Miller
Vadim's "scheduling within an application" approach turned out to be excellent, at least on a single node with the CPU usage reaching about 90%. I directly implemented the code template that Vadim kindly provided: parallel_collection_paths.foreach( path => { val lines =

Re: More instances = slower Spark job

2017-10-01 Thread Gourav Sengupta
Hi Jeroen, I do not believe that I completely agree with the idea that you will be spending more time and memory that way. But if that was also the case why are you not using data frames and UDF? Regards, Gourav On Sun, Oct 1, 2017 at 6:17 PM, Jeroen Miller wrote: >

Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas
Hammad, The recommended way to implement this logic would be to: Create a SparkSession. Create a Streaming Context using the SparkContext embedded in the SparkSession Use the single SparkSession instance for the SQL operations within the foreachRDD. It's important to note that spark operations

Re: NullPointerException error while saving Scala Dataframe to HBase

2017-10-01 Thread Marco Mistroni
Hi The question is getting to the list. I have no experience in hbase ...though , having seen similar stuff when saving a df somewhere else...it might have to do with the properties you need to set to let spark know it is dealing with hbase? Don't u need to set some properties on the spark

Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Hammad
Hello, *Background:* I have Spark Streaming context; SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC"); conf.set("spark.driver.allowMultipleContexts", "true"); *<== this* JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(60));

Re: More instances = slower Spark job

2017-10-01 Thread Jeroen Miller
On Fri, Sep 29, 2017 at 12:20 AM, Gourav Sengupta wrote: > Why are you not using JSON reader of SPARK? Since the filter I want to perform is so simple, I do not want to spend time and memory to deserialise the JSON lines. Jeroen

How to find the temporary views' DDL

2017-10-01 Thread Sun, Keith
Hello, Is there a way to find the DDL of the “temporary” view created in current session with spark sql: For example : create or replace temporary view tmp_v as select c1 from table table_x; “Show create table “ does not work for this case as it is not a table . “Describe” could show the

Re: Error - Spark reading from HDFS via dataframes - Java

2017-10-01 Thread Anastasios Zouzias
Hi, Set the inferschema option to true in spark-csv. you may also want to set the mode option. See readme below https://github.com/databricks/spark-csv/blob/master/README.md Best, Anastasios Am 01.10.2017 07:58 schrieb "Kanagha Kumar" : Hi, I'm trying to read data