Looks like this data was encoded with an old version of Spark SQL. You'll need to set the flag to interpret binary data as a string. More info on configuration can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
sqlContext.sql("set spark.sql.parquet.binaryAsString=true") Michael On Fri, Oct 17, 2014 at 6:32 AM, neeraj <neeraj_gar...@infosys.com> wrote: > Hi, > > When I run given Spark SQL commands in the exercise, it returns with > unexpected results. I'm explaining the results below for quick reference: > 1. The output of query : wikiData.count() shows some count in the file. > > 2. after running following query: > sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE > username <> '' GROUP BY username ORDER BY cnt DESC LIMIT > 10").collect().foreach(println) > > I get output like below. Couple of last lines of this output is shown here. > It doesn't show the actual results of query. I tried increasing the driver > memory as suggested in the exercise, however, id doesn't work. The output > is > almost same. > 14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0 > (TID 401). 2170 bytes result sent to driver > 14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0 > (TID 400). 2170 bytes result sent to driver > 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in > stage 2.0 (TID 400) in 13 ms on localhost (199/200) > 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in > stage 2.0 (TID 401) in 10 ms on localhost (200/200) > 14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, > whose tasks have all completed, from pool > 14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at > basicOperators.scala:171) finished in 1.296 s > 14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at > basicOperators.scala:171, took 3.150021634 s > > 3. I tried some other Spark SQL commands as below: > *sqlContext.sql("SELECT username FROM wikiData LIMIT > 10").collect().foreach(println)* > *output is* : [[B@787cf559] > [[B@53cfe3db] > [[B@757869d9] > [[B@346d61cf] > [[B@793077ec] > [[B@5d11651c] > [[B@21054100] > [[B@5fee77ef] > [[B@21041d1d] > [[B@15136bda] > > > *sqlContext.sql("SELECT * FROM wikiData LIMIT > 10").collect().foreach(println)* > *output is *: > [12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a] > [12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7] > [12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32] > [12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a] > [12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd] > [12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f] > [12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31] > [12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673] > [12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f] > [12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35] > > I'm sure the about output of the queries is not the correct content of > parquet file.. I'm not able to read the content of parquet file directly. > > How to validate the output of these queries with the actual content in the > parquet file. > What is the workaround for this issue. > How to read the file through Spark SQL. > Is there a need to change the queries? What changes can be made in the > queries to get the exact result. > > Regards, > Neeraj > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >