Re: Problems with reading data from parquet files in a HDFS remotely
Hi Ewan, Thank you for your answer. I have already tried what you suggest. If I use: "hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC" I get the AssertionError exception: Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC. Note: The IP address of my Spark Master is: 172.27.13.57 If I do as as you suggest literally: "hdfs:///user/hdfs/parquet-multi/BICC" I get an IOException: Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:///user/hdfs/parquet-multi/BICC To me it seams that the Spark library try to resolve the URI locally and I suspect I miss something in my configuration of the SparkContext, but do not know what. Or could it be that I use the wrong port in the hdfs:// URI above? Henrik On 07/01/2016 19:41, Ewan Leith wrote: > > Try the path > > > "hdfs:///user/hdfs/parquet-multi/BICC" > Thanks, > Ewan > > > -- Original message-- > > *From: *Henrik Baastrup > > *Date: *Thu, 7 Jan 2016 17:54 > > *To: *user@spark.apache.org; > > *Cc: *Baastrup, Henrik; > > *Subject:*Problems with reading data from parquet files in a HDFS remotely > > > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet > files. I have installed a Spark master service on one of the nodes and now > would like to query my parquet files from a Spark client. When I run the > following program from the spark-shell on the Spark Master node all function > correct: > > # val sqlCont = new org.apache.spark.sql.SQLContext(sc) > # val reader = sqlCont.read > # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") > # dataFrame.registerTempTable("BICC") > # val recSet = sqlCont.sql("SELECT > protocolCode,beginTime,endTime,called,calling FROM BICC WHERE > endTime>=14494218 AND endTime<=14494224 AND > calling='6287870642893' AND p_endtime=14494224") > # recSet.show() > > But when I run the Java program below, from my client, I get: > > Exception in thread "main" java.lang.AssertionError: assertion failed: No > predefined schema found, and no Parquet data files or summary files found > under file:/user/hdfs/parquet-multi/BICC. > > The exception occurs at the line: DataFrame df = > reader.parquet("/user/hdfs/parquet-multi/BICC"); > > On the Master node I can see the client connect when the SparkContext is > instanced, as I get the following lines in the Spark log: > > 16/01/07 18:27:47 INFO Master: Registering app SparkTest > 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID > app-20160107182747-00801 > > If I create a local directory with the given path, my program goes in an > endless loop, with the following warning on the console: > > WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources > > To me it seams that my SQLContext does not connect to the Spark Master, but > try to work locally on the client, where the requested files do not exist. > > Java program: > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("spark://172.27.13.57:7077"); > JavaSparkContext sc = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sc); > > DataFrameReader reader = sqlContext.read(); > DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); > DataFrame filtered = df.filter("endTime>=14494218 AND > endTime<=14494224 AND calling='6287870642893' AND > p_endtime=14494224"); > filtered.show(); > > Are there someone there can help me? > > Henrik >
Re: Problems with reading data from parquet files in a HDFS remotely
I solved the problem. I needed to tell the SparkContext about my Hadoop set up, so now my program is as follow: SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("spark://172.27.13.57:7077") .set("spark.executor.memory", "2g") // We assign 2 GB ram to our job on each Worker .set("spark.driver.port", "51810"); // Fix the port the driver will listen on, good for firewalls! JavaSparkContext sc = new JavaSparkContext(conf); // Tell Spark about our Hadoop environment File coreSite = new File("/etc/hadoop/conf/core-site.xml"); File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml"); Configuration hConf = sc.hadoopConfiguration(); hConf.addResource(new Path(coreSite.getAbsolutePath())); hConf.addResource(new Path(hdfsSite.getAbsolutePath())); SQLContext sqlContext = new SQLContext(sc); DataFrameReader reader = sqlContext.read(); DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); DataFrame filtered = df.filter("endTime>=14494218 AND endTime<=14494224 AND calling='6287870642893' AND p_endtime=14494224"); filtered.show(); Henrik On 07/01/2016 19:41, Ewan Leith wrote: > > Try the path > > > "hdfs:///user/hdfs/parquet-multi/BICC" > Thanks, > Ewan > > > -- Original message-- > > *From: *Henrik Baastrup > > *Date: *Thu, 7 Jan 2016 17:54 > > *To: *user@spark.apache.org; > > *Cc: *Baastrup, Henrik; > > *Subject:*Problems with reading data from parquet files in a HDFS remotely > > > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet > files. I have installed a Spark master service on one of the nodes and now > would like to query my parquet files from a Spark client. When I run the > following program from the spark-shell on the Spark Master node all function > correct: > > # val sqlCont = new org.apache.spark.sql.SQLContext(sc) > # val reader = sqlCont.read > # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") > # dataFrame.registerTempTable("BICC") > # val recSet = sqlCont.sql("SELECT > protocolCode,beginTime,endTime,called,calling FROM BICC WHERE > endTime>=14494218 AND endTime<=14494224 AND > calling='6287870642893' AND p_endtime=14494224") > # recSet.show() > > But when I run the Java program below, from my client, I get: > > Exception in thread "main" java.lang.AssertionError: assertion failed: No > predefined schema found, and no Parquet data files or summary files found > under file:/user/hdfs/parquet-multi/BICC. > > The exception occurs at the line: DataFrame df = > reader.parquet("/user/hdfs/parquet-multi/BICC"); > > On the Master node I can see the client connect when the SparkContext is > instanced, as I get the following lines in the Spark log: > > 16/01/07 18:27:47 INFO Master: Registering app SparkTest > 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID > app-20160107182747-00801 > > If I create a local directory with the given path, my program goes in an > endless loop, with the following warning on the console: > > WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources > > To me it seams that my SQLContext does not connect to the Spark Master, but > try to work locally on the client, where the requested files do not exist. > > Java program: > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("spark://172.27.13.57:7077"); > JavaSparkContext sc = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sc); > > DataFrameReader reader = sqlContext.read(); > DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); > DataFrame filtered = df.filter("endTime>=14494218 AND > endTime<=14494224 AND calling='6287870642893' AND > p_endtime=14494224"); > filtered.show(); > > Are there someone there can help me? > > Henrik >
Re: Problems with reading data from parquet files in a HDFS remotely
you many need to add createDataFrame( for Python, inferschema) call before registerTempTable. Thanks, Prem On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup < henrik.baast...@netscout.com> wrote: > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet > files. I have installed a Spark master service on one of the nodes and now > would like to query my parquet files from a Spark client. When I run the > following program from the spark-shell on the Spark Master node all function > correct: > > # val sqlCont = new org.apache.spark.sql.SQLContext(sc) > # val reader = sqlCont.read > # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") > # dataFrame.registerTempTable("BICC") > # val recSet = sqlCont.sql("SELECT > protocolCode,beginTime,endTime,called,calling FROM BICC WHERE > endTime>=14494218 AND endTime<=14494224 AND > calling='6287870642893' AND p_endtime=14494224") > # recSet.show() > > But when I run the Java program below, from my client, I get: > > Exception in thread "main" java.lang.AssertionError: assertion failed: No > predefined schema found, and no Parquet data files or summary files found > under file:/user/hdfs/parquet-multi/BICC. > > The exception occurs at the line: DataFrame df = > reader.parquet("/user/hdfs/parquet-multi/BICC"); > > On the Master node I can see the client connect when the SparkContext is > instanced, as I get the following lines in the Spark log: > > 16/01/07 18:27:47 INFO Master: Registering app SparkTest > 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID > app-20160107182747-00801 > > If I create a local directory with the given path, my program goes in an > endless loop, with the following warning on the console: > > WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources > > To me it seams that my SQLContext does not connect to the Spark Master, but > try to work locally on the client, where the requested files do not exist. > > Java program: > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("spark://172.27.13.57:7077"); > JavaSparkContext sc = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sc); > > DataFrameReader reader = sqlContext.read(); > DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); > DataFrame filtered = df.filter("endTime>=14494218 AND > endTime<=14494224 AND calling='6287870642893' AND > p_endtime=14494224"); > filtered.show(); > > Are there someone there can help me? > > Henrik > > >
Problems with reading data from parquet files in a HDFS remotely
Hi All, I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct: # val sqlCont = new org.apache.spark.sql.SQLContext(sc) # val reader = sqlCont.read # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") # dataFrame.registerTempTable("BICC") # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=14494218 AND endTime<=14494224 AND calling='6287870642893' AND p_endtime=14494224") # recSet.show() But when I run the Java program below, from my client, I get: Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC. The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log: 16/01/07 18:27:47 INFO Master: Registering app SparkTest 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801 If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console: WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist. Java program: SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("spark://172.27.13.57:7077"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); DataFrameReader reader = sqlContext.read(); DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); DataFrame filtered = df.filter("endTime>=14494218 AND endTime<=14494224 AND calling='6287870642893' AND p_endtime=14494224"); filtered.show(); Are there someone there can help me? Henrik