Re: HBase connector does not read ZK configuration from Spark session

2018-02-22 Thread Jorge Machado
Can it be that you are missing the HBASE_HOME var ? Jorge Machado > On 23 Feb 2018, at 04:55, Dharmin Siddesh J wrote: > > I am trying to write a Spark program that reads data from HBase and store it > in DataFrame. > > I am able to run it perfectly with

Re: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread naresh Goud
Here is my understanding, hope this gives some idea to understand how it works. It might be wrong also, please excuse if it’s . I am trying to derivating execution model with my understanding. Sorry it’s long email. driver will keep polling Kafka for latest offset of each topic and then it

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread naresh Goud
Got it. I understood issue in different way. On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman wrote: > My issue is that there is not enough pressure on GC, hence GC is not > kicking in fast enough to delete the shuffle files of previous iterations. > > Regards, > Keith.

HBase connector does not read ZK configuration from Spark session

2018-02-22 Thread Dharmin Siddesh J
I am trying to write a Spark program that reads data from HBase and store it in DataFrame. I am able to run it perfectly with hbase-site.xml in the $SPARK_HOME/conf folder, but I am facing few issues here. Issue 1 The first issue is passing hbase-site.xml location with the --files parameter

RE: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread Vibhakar, Beejal
Naresh – Thanks for taking out time to respond. So is it right to say that it’s the Driver program which at every 30 seconds tells the executors (Which manage the Streams) to run rather than each executor making that decision themselves? And this really makes it sequential execution in my

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
My issue is that there is not enough pressure on GC, hence GC is not kicking in fast enough to delete the shuffle files of previous iterations. Regards, Keith. http://keith-chapman.com On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud wrote: > It would be very difficult

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread naresh Goud
It would be very difficult to tell without knowing what is your application code doing, what kind of transformation/actions performing. From my previous experience tuning application code which avoids unnecessary objects reduce pressure on GC. On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman

Re: Return statements aren't allowed in Spark closures

2018-02-22 Thread naresh Goud
Even i am not able to reproduce error On Thu, Feb 22, 2018 at 2:51 AM, Michael Artz wrote: > I am not able to reproduce your error. You should do something before you > do that last function and maybe get some more help from the exception it > returns. Like just add a

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread kant kodali
Hi TD, I pulled your commit that is listed on this ticket https://issues.apache.org/jira/browse/SPARK-23406 specifically I did the following steps and self joins work after I cherry-pick your commit! Good Job! I was hoping it will be part of 2.3.0 but looks like it is targeted for 2.3.1 :( git

Re: Can spark handle this scenario?

2018-02-22 Thread Lian Jiang
Hi Vijay, Should HTTPConnection() (or any other object created per partition) be serializable so that your code work? If so, the usage seems to be limited. Sometimes, the error caused by a non-serializable object can be very misleading (e.g. "Return statements aren't allowed in Spark closures")

Re: parquet vs orc files

2018-02-22 Thread Jörn Franke
Look at the documentation of the formats. In any case: * use additionally partitions on the filesystem * sort the data on filter columns - otherwise you do not benefit form min/max and bloom filters > On 21. Feb 2018, at 22:58, Kane Kim wrote: > > Thanks, how does

Re: parquet vs orc files

2018-02-22 Thread Kurt Fehlhauer
Hi Kane, It really depends on your use case. I generally use Parquet because it seems to have better support beyond Spark. However, if you are dealing with partitioned Hive tables, the current versions of Spark have an issue where compression will not be applied. This will be fixed in version

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread Tathagata Das
Hey, Thanks for testing out stream-stream joins and reporting this issue. I am going to take a look at this. TD On Tue, Feb 20, 2018 at 8:20 PM, kant kodali wrote: > if I change it to the below code it works. However, I don't believe it is > the solution I am looking

Re: Log analysis with GraphX

2018-02-22 Thread JB Data
A new one created with my basic *datayse.* @*JB*Δ 2018-02-21 13:14 GMT+01:00 Philippe de Rochambeau : > Hi JB, > which column in the 8 line DS do you regress ? > > > Le 21 févr. 2018 09:47, JB Data a écrit : > > Hi, > > Interesting

Re: Return statements aren't allowed in Spark closures

2018-02-22 Thread Michael Artz
I am not able to reproduce your error. You should do something before you do that last function and maybe get some more help from the exception it returns. Like just add a csv.show (1) on the line before. Also, can you post the different exception when you took out the "return" value like when

Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
Hi, I'm benchmarking a spark application by running it for multiple iterations, its a benchmark thats heavy on shuffle and I run it on a local machine with a very large hear (~200GB). The system has a SSD. When running for 3 to 4 iterations I get into a situation that I run out of disk space on

Hortonworks Spark-Hbase-Connector does not read zookeeper configurations from spark session config ??(Spark on Yarn)

2018-02-22 Thread Dharmin Siddesh J
Hi I am trying to write a spark code that reads data from Hbase and store it in DataFrame. I am able to run it perfectly with hbase-site.xml in $spark-home/conf folder. But I am facing few issues Here. Issue 1: Passing hbase-site.xml location with --file parameter submitted through client