at last some progress :) Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 15 June 2016 at 10:52, Lee Ho Yeung <jobmatt...@gmail.com> wrote: > Hi Mich, > > i find my problem cause now, i missed setting delimiter which is tab, > > but it got error, > > and i notice that only libre office and open and read well, even if Excel > in window, it still can not separate in well format > > scala> val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("delimiter", > "").load("/home/martin/result002.csv") > java.lang.StringIndexOutOfBoundsException: String index out of range: 0 > > > On Wed, Jun 15, 2016 at 12:14 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> there may be an issue with data in your csv file. like blank header line >> etc. >> >> sounds like you have an issue there. I normally get rid of blank lines >> before putting csv file in hdfs. >> >> can you actually select from that temp table. like >> >> sql("select TransactionDate, TransactionType, Description, Value, >> Balance, AccountName, AccountNumber from tmp").take(2) >> >> replace those with your column names. they are mapped using case class >> >> >> HTH >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 15 June 2016 at 03:02, Lee Ho Yeung <jobmatt...@gmail.com> wrote: >> >>> filter also has error >>> >>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port >>> 4040. Attempting port 4041. >>> Spark context available as sc. >>> SQL context available as sqlContext. >>> >>> scala> import org.apache.spark.sql.SQLContext >>> import org.apache.spark.sql.SQLContext >>> >>> scala> val sqlContext = new SQLContext(sc) >>> sqlContext: org.apache.spark.sql.SQLContext = >>> org.apache.spark.sql.SQLContext@3114ea >>> >>> scala> val df = >>> sqlContext.read.format("com.databricks.spark.csv").option("header", >>> "true").option("inferSchema", "true").load("/home/martin/result002.csv") >>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether >>> UseCompressedOops is set; assuming yes >>> Java HotSpot(TM) Client VM warning: You have loaded library >>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have >>> disabled stack guard. The VM will try to fix the stack guard now. >>> It's highly recommended that you fix the library with 'execstack -c >>> <libfile>', or link it with '-z noexecstack'. >>> df: org.apache.spark.sql.DataFrame = [a0 a1 a2 a3 a4 >>> a5 a6 a7 a8 a9 : string] >>> >>> scala> df.printSchema() >>> root >>> |-- a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 : >>> string (nullable = true) >>> >>> >>> scala> df.registerTempTable("sales") >>> >>> scala> df.filter($"a0".contains("found >>> deep=1")).filter($"a1".contains("found >>> deep=1")).filter($"a2".contains("found deep=1")) >>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input >>> columns: [a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 ]; >>> at >>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) >>> >>> >>> >>> >>> >>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung <jobmatt...@gmail.com> >>> wrote: >>> >>>> after tried following commands, can not show data >>>> >>>> >>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing >>>> >>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing >>>> >>>> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages >>>> com.databricks:spark-csv_2.11:1.4.0 >>>> >>>> import org.apache.spark.sql.SQLContext >>>> >>>> val sqlContext = new SQLContext(sc) >>>> val df = >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", >>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv") >>>> df.printSchema() >>>> df.registerTempTable("sales") >>>> val aggDF = sqlContext.sql("select * from sales where a0 like >>>> \"%deep=3%\"") >>>> df.collect.foreach(println) >>>> aggDF.collect.foreach(println) >>>> >>>> >>>> >>>> val df = >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", >>>> "true").load("/home/martin/result002.csv") >>>> df.printSchema() >>>> df.registerTempTable("sales") >>>> sqlContext.sql("select * from sales").take(30).foreach(println) >>>> >>> >>> >> >