Error - Spark reading from HDFS via dataframes - Java

2017-09-30 Thread Kanagha Kumar
Hi, I'm trying to read data from HDFS in spark as dataframes. Printing the schema, I see all columns are being read as strings. I'm converting it to RDDs and creating another dataframe by passing in the correct schema ( how the rows should be interpreted finally). I'm getting the following

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-09-30 Thread Debabrata Ghosh
Ayan, Did you get to work the HBase Connection through Pyspark as well ? I have got the Spark - HBase connection working with Scala (via HBasecontext). However, but I eventually want to get this working within a Pyspark code - Would you have some suitable code snippets or

Re: NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread mailfordebu
Hi Guys- am not sure whether the email is reaching to the community members. Please can somebody acknowledge Sent from my iPhone > On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh wrote: > > Dear All, >Greetings ! I am repeatedly hitting a

Re: Structured Streaming and Hive

2017-09-30 Thread Jacek Laskowski
Hi, Guessing it's a timing issue. Once you started the query the batch 0 did not have rows to save or didn't start yet (it's a separate thread) and so spark.sql ran once and saved nothing. You should rather use foreach writer to save results to Hive. Jacek On 29 Sep 2017 11:36 am, "HanPan"

NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread Debabrata Ghosh
Dear All, Greetings ! I am repeatedly hitting a NullPointerException error while saving a Scala Dataframe to HBase. Please can you help resolving this for me. Here is the code snippet: scala> def catalog = s"""{ ||"table":{"namespace":"default", "name":"table1"},

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 20:03, JG Perrin > wrote: You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. no, it doesn't work quite like that. 1. workers generate their data and

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 15:59, Alexander Czech > wrote: Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3

py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed out

2017-09-30 Thread Krishnaprasad
Hi all, I am developing an application that can run on Apache Spark (setup on single node) and as part of the implementation, I am using PySpark version 2.2.0. Environment - OS is Ubuntu 14.04 and Python version is 3.4. I am getting the following error as shown below. It will be helpful if