Re: Help required on exercise Data Exploratin using Spark SQL

Michael Armbrust Fri, 17 Oct 2014 07:32:29 -0700

Looks like this data was encoded with an old version of Spark SQL.  You'll
need to set the flag to interpret binary data as a string.  More info on
configuration can be found here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration


sqlContext.sql("set spark.sql.parquet.binaryAsString=true")

Michael

On Fri, Oct 17, 2014 at 6:32 AM, neeraj <neeraj_gar...@infosys.com> wrote:

> Hi,
>
> When I run given Spark SQL commands in the exercise, it returns with
> unexpected results. I'm explaining the results below for quick reference:
> 1. The output of query : wikiData.count() shows some count in the file.
>
> 2. after running following query:
> sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE
> username <> '' GROUP BY username ORDER BY cnt DESC LIMIT
> 10").collect().foreach(println)
>
> I get output like below. Couple of last lines of this output is shown here.
> It doesn't show the actual results of query. I tried increasing the driver
> memory as suggested in the exercise, however, id doesn't work. The output
> is
> almost same.
> 14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0
> (TID 401). 2170 bytes result sent to driver
> 14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0
> (TID 400). 2170 bytes result sent to driver
> 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in
> stage 2.0 (TID 400) in 13 ms on localhost (199/200)
> 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in
> stage 2.0 (TID 401) in 10 ms on localhost (200/200)
> 14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0,
> whose tasks have all completed, from pool
> 14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at
> basicOperators.scala:171) finished in 1.296 s
> 14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at
> basicOperators.scala:171, took 3.150021634 s
>
> 3. I tried some other Spark SQL commands as below:
> *sqlContext.sql("SELECT username FROM wikiData LIMIT
> 10").collect().foreach(println)*
> *output is* : [[B@787cf559]
> [[B@53cfe3db]
> [[B@757869d9]
> [[B@346d61cf]
> [[B@793077ec]
> [[B@5d11651c]
> [[B@21054100]
> [[B@5fee77ef]
> [[B@21041d1d]
> [[B@15136bda]
>
>
> *sqlContext.sql("SELECT * FROM wikiData LIMIT
> 10").collect().foreach(println)*
> *output is *:
> [12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a]
> [12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7]
> [12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32]
> [12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a]
> [12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd]
> [12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f]
> [12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31]
> [12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673]
> [12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f]
> [12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35]
>
> I'm sure the about output of the queries is not the correct content of
> parquet file.. I'm not able to read the content of parquet file directly.
>
> How to validate the output of these queries with the actual content in the
> parquet file.
> What is the workaround for this issue.
> How to read the file through Spark SQL.
> Is there a need to change the queries? What changes can be made in the
> queries to get the exact result.
>
> Regards,
> Neeraj
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Help required on exercise Data Exploratin using Spark SQL

Reply via email to