Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException
drop() function is in scala,an attribute of Array,no in spark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p28127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: Spark SQL: ArrayIndexOutofBoundsException
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Thu, Oct 2, 2014 at 3:42 PM Subject: Re: Spark SQL: ArrayIndexOutofBoundsException To: SK skrishna...@gmail.com There is only one place you use index 1. One possible issue is that your may have only one element after your split by \t. Can you try to run the following code to make sure every line has at least two elements? val tusers = sc.textFile(inp_file) .map(_.split(\t)) .filter( x = x.length 2) .count() It should return non zero values if your data contains a line with less than two values Liquan On Thu, Oct 2, 2014 at 3:35 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to extract the number of distinct users from a file using Spark SQL, but I am getting the following error: ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 15) java.lang.ArrayIndexOutOfBoundsException: 1 I am following the code in examples/sql/RDDRelation.scala. My code is as follows. The error is appearing when it executes the SQL statement. I am new to Spark SQL. I would like to know how I can fix this issue. thanks for your help. val sql_cxt = new SQLContext(sc) import sql_cxt._ // read the data using th e schema and create a schema RDD val tusers = sc.textFile(inp_file) .map(_.split(\t)) .map(p = TUser(p(0), p(1).trim.toInt)) // register the RDD as a table tusers.registerTempTable(tusers) // get the number of unique users val unique_count = sql_cxt.sql(SELECT COUNT (DISTINCT userid) FROM tusers).collect().head.getLong(0) println(unique_count) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Liquan Pei Department of Physics University of Massachusetts Amherst
Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException
Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header line .map // rest of the processing I could not find a drop() function or take the bottom (n) elements for RDD. Alternatively, a way to create the case class schema from the header line of the file and use the rest for the data would be useful - just as a suggestion. Currently I am just deleting this header line manually before processing it in Spark. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p15642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException
You can do filter with startswith ? On Thu, Oct 2, 2014 at 4:04 PM, SK skrishna...@gmail.com wrote: Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header line .map // rest of the processing I could not find a drop() function or take the bottom (n) elements for RDD. Alternatively, a way to create the case class schema from the header line of the file and use the rest for the data would be useful - just as a suggestion. Currently I am just deleting this header line manually before processing it in Spark. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p15642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException
This is hard to do in general, but you can get what you are asking for by putting the following class in scope. implicit class BetterRDD[A: scala.reflect.ClassTag](rdd: org.apache.spark.rdd.RDD[A]) { def dropOne = rdd.mapPartitionsWithIndex((i, iter) = if(i == 0 iter.hasNext) { iter.next; iter } else iter) } On Thu, Oct 2, 2014 at 4:06 PM, Sunny Khatri sunny.k...@gmail.com wrote: You can do filter with startswith ? On Thu, Oct 2, 2014 at 4:04 PM, SK skrishna...@gmail.com wrote: Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header line .map // rest of the processing I could not find a drop() function or take the bottom (n) elements for RDD. Alternatively, a way to create the case class schema from the header line of the file and use the rest for the data would be useful - just as a suggestion. Currently I am just deleting this header line manually before processing it in Spark. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p15642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org