Re: Problems with reading data from parquet files in a HDFS remotely

Prem Sure Thu, 07 Jan 2016 19:24:27 -0800

you many need to add

createDataFrame( for Python, inferschema) call before registerTempTable.


Thanks,

Prem


On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup <
henrik.baast...@netscout.com> wrote:

> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=1449421800000000 AND endTime<=1449422400000000 AND 
> calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()
>
> But when I run the Java program below, from my client, I get:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>       SparkConf conf = new SparkConf()
>               .setAppName("SparkTest")
>               .setMaster("spark://172.27.13.57:7077");
>       JavaSparkContext sc = new JavaSparkContext(conf);
>       SQLContext sqlContext = new SQLContext(sc);
>       
>       DataFrameReader reader = sqlContext.read();
>       DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>       DataFrame filtered = df.filter("endTime>=1449421800000000 AND 
> endTime<=1449422400000000 AND calling='6287870642893' AND 
> p_endtime=1449422400000000");
>       filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>
>
>

Re: Problems with reading data from parquet files in a HDFS remotely

Reply via email to