Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
Hi Ewan,

Thank you for your answer.
I have already tried what you suggest.

If I use:
"hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC"
I get the AssertionError exception:
Exception in thread "main" java.lang.AssertionError: assertion
failed: No predefined schema found, and no Parquet data files or summary
files found under hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC.
Note: The IP address of my Spark Master is: 172.27.13.57

If I do as as you suggest literally:
"hdfs:///user/hdfs/parquet-multi/BICC"
I get an IOException:
Exception in thread "main" java.io.IOException: Incomplete HDFS URI,
no host: hdfs:///user/hdfs/parquet-multi/BICC

To me it seams that the Spark library try to resolve the URI locally and
I suspect I miss something in my configuration of the SparkContext, but
do not know what.
Or could it be that I use the wrong port in the hdfs:// URI above?

Henrik




On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> -- Original message--
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=14494218 AND endTime<=14494224 AND 
> calling='6287870642893' AND p_endtime=14494224")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>   SparkConf conf = new SparkConf()
>   .setAppName("SparkTest")
>   .setMaster("spark://172.27.13.57:7077");
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
>   
>   DataFrameReader reader = sqlContext.read();
>   DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>   DataFrame filtered = df.filter("endTime>=14494218 AND 
> endTime<=14494224 AND calling='6287870642893' AND 
> p_endtime=14494224");
>   filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>



Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
I solved the problem. I needed to tell the SparkContext about my Hadoop
set up, so now my program is as follow:

SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("spark://172.27.13.57:7077")
.set("spark.executor.memory", "2g") // We assign 2 GB ram to our
job on each Worker
.set("spark.driver.port", "51810"); // Fix the port the driver
will listen on, good for firewalls!
JavaSparkContext sc = new JavaSparkContext(conf);

// Tell Spark about our Hadoop environment
File coreSite = new File("/etc/hadoop/conf/core-site.xml");
File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml");
Configuration hConf =  sc.hadoopConfiguration();
hConf.addResource(new Path(coreSite.getAbsolutePath()));
hConf.addResource(new Path(hdfsSite.getAbsolutePath()));

SQLContext sqlContext = new SQLContext(sc);

DataFrameReader reader = sqlContext.read();
DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
DataFrame filtered = df.filter("endTime>=14494218 AND
endTime<=14494224 AND calling='6287870642893' AND
p_endtime=14494224");
filtered.show();

Henrik

On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> -- Original message--
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=14494218 AND endTime<=14494224 AND 
> calling='6287870642893' AND p_endtime=14494224")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>   SparkConf conf = new SparkConf()
>   .setAppName("SparkTest")
>   .setMaster("spark://172.27.13.57:7077");
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
>   
>   DataFrameReader reader = sqlContext.read();
>   DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>   DataFrame filtered = df.filter("endTime>=14494218 AND 
> endTime<=14494224 AND calling='6287870642893' AND 
> p_endtime=14494224");
>   filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>



Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
you many need to add

createDataFrame( for Python, inferschema) call before registerTempTable.

Thanks,

Prem


On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup <
henrik.baast...@netscout.com> wrote:

> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=14494218 AND endTime<=14494224 AND 
> calling='6287870642893' AND p_endtime=14494224")
> # recSet.show()
>
> But when I run the Java program below, from my client, I get:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>   SparkConf conf = new SparkConf()
>   .setAppName("SparkTest")
>   .setMaster("spark://172.27.13.57:7077");
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
>   
>   DataFrameReader reader = sqlContext.read();
>   DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>   DataFrame filtered = df.filter("endTime>=14494218 AND 
> endTime<=14494224 AND calling='6287870642893' AND 
> p_endtime=14494224");
>   filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>
>
>


Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Henrik Baastrup
Hi All,

I have a small Hadoop cluster where I have stored a lot of data in parquet 
files. I have installed a Spark master service on one of the nodes and now 
would like to query my parquet files from a Spark client. When I run the 
following program from the spark-shell on the Spark Master node all function 
correct:

# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT 
protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
endTime>=14494218 AND endTime<=14494224 AND 
calling='6287870642893' AND p_endtime=14494224")
# recSet.show()  

But when I run the Java program below, from my client, I get: 

Exception in thread "main" java.lang.AssertionError: assertion failed: No 
predefined schema found, and no Parquet data files or summary files found under 
file:/user/hdfs/parquet-multi/BICC.

The exception occurs at the line: DataFrame df = 
reader.parquet("/user/hdfs/parquet-multi/BICC");

On the Master node I can see the client connect when the SparkContext is 
instanced, as I get the following lines in the Spark log:

16/01/07 18:27:47 INFO Master: Registering app SparkTest
16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
app-20160107182747-00801

If I create a local directory with the given path, my program goes in an 
endless loop, with the following warning on the console:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient 
resources

To me it seams that my SQLContext does not connect to the Spark Master, but try 
to work locally on the client, where the requested files do not exist.

Java program:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("spark://172.27.13.57:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

DataFrameReader reader = sqlContext.read();
DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
DataFrame filtered = df.filter("endTime>=14494218 AND 
endTime<=14494224 AND calling='6287870642893' AND 
p_endtime=14494224");
filtered.show();

Are there someone there can help me?

Henrik