[ https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jai Murugesh Rajasekaran closed SPARK-12984. -------------------------------------------- > Not able to read CSV file using Spark 1.4.0 > ------------------------------------------- > > Key: SPARK-12984 > URL: https://issues.apache.org/jira/browse/SPARK-12984 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 1.4.0 > Environment: Unix > Hadoop 2.7.1.2.3.0.0-2557 > R 3.1.1 > Don't have Internet on the server > Reporter: Jai Murugesh Rajasekaran > > Hi, > We are trying to read a CSV file > Downloaded following CSV related package (jar files) and configured using > Maven > 1. spark-csv_2.10-1.2.0.jar > 2. spark-csv_2.10-1.2.0-sources.jar > 3. spark-csv_2.10-1.2.0-javadoc.jar > Trying to execute following script > > library(SparkR) > > sc <- sparkR.init(appName="SparkR-DataFrame") > Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or > restart R to create a new Spark Context > > sqlContext <- sparkRSQL.init(sc) > > setwd("/home/sXXXX/") > > getwd() > [1] "/home/sXXXX" > > path <- file.path("Sample.csv") > > Test <- read.df(sqlContext, path) > Note: I am able to read CSV file using regular R function but when tried > using SparkR functions...ended up with error > Initiated SparkR > $ sh -x sparkR -v --repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > Error Messages/Log > $ sh -x sparkR -v --repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > +++ dirname sparkR > ++ cd ./.. > ++ pwd > + export SPARK_HOME=/opt/spark-1.4.0 > + SPARK_HOME=/opt/spark-1.4.0 > + source /opt/spark-1.4.0/bin/load-spark-env.sh > ++++ dirname sparkR > +++ cd ./.. > +++ pwd > ++ FWDIR=/opt/spark-1.4.0 > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > ++++ dirname sparkR > +++ cd ./.. > +++ pwd > ++ parent_dir=/opt/spark-1.4.0 > ++ user_conf_dir=/opt/spark-1.4.0/conf > ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']' > ++ set -a > ++ . /opt/spark-1.4.0/conf/spark-env.sh > +++ export SPARK_HOME=/opt/spark-1.4.0 > +++ SPARK_HOME=/opt/spark-1.4.0 > +++ export YARN_CONF_DIR=/etc/hadoop/conf > +++ YARN_CONF_DIR=/etc/hadoop/conf > +++ export HADOOP_CONF_DIR=/etc/hadoop/conf > +++ HADOOP_CONF_DIR=/etc/hadoop/conf > +++ export HADOOP_CONF_DIR=/etc/hadoop/conf > +++ HADOOP_CONF_DIR=/etc/hadoop/conf > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10 > ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]] > ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + export -f usage > + [[ -v --repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > = *--help ]] > + [[ -v --repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > = *-h ]] > + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > R version 3.1.1 (2014-07-10) -- "Sock it to Me" > Copyright (C) 2014 The R Foundation for Statistical Computing > Platform: x86_64-unknown-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Revolution R Enterprise version 7.3: an enhanced distribution of R > Revolution Analytics packages Copyright (C) 2014 Revolution Analytics, Inc. > Type 'revo()' to visit www.revolutionanalytics.com for the latest > Revolution R news, 'forum()' for the community forum, or 'readme()' > for release notes. > Launching java with spark-submit command /opt/spark-1.4.0/bin/spark-submit > "--verbose" "--repositories" > "/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar" > "sparkr-shell" /tmp/RtmpO12CGx/backend_porteb570d7ca99 > Using properties file: /opt/spark-1.4.0/conf/spark-defaults.conf > Adding default property: > spark.yarn.am.extraJavaOptions=-Dhdp.version=2.2.0.0-2041 > Adding default property: > spark.driver.extraJavaOptions=-Dhdp.version=2.2.0.0-2041 > Parsed arguments: > master local[*] > deployMode null > executorMemory null > executorCores null > totalExecutorCores null > propertiesFile /opt/spark-1.4.0/conf/spark-defaults.conf > driverMemory null > driverCores null > driverExtraClassPath null > driverExtraLibraryPath null > driverExtraJavaOptions -Dhdp.version=2.2.0.0-2041 > supervise false > queue null > numExecutors null > files null > pyFiles null > archives null > mainClass null > primaryResource sparkr-shell > name sparkr-shell > childArgs [/tmp/RtmpO12CGx/backend_porteb570d7ca99] > jars null > packages null > repositories > /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > verbose true > Spark properties used, including those specified through > --conf and those from the properties file > /opt/spark-1.4.0/conf/spark-defaults.conf: > spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041 > spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041 > Main class: > org.apache.spark.api.r.RBackend > Arguments: > /tmp/RtmpO12CGx/backend_porteb570d7ca99 > System properties: > SPARK_SUBMIT -> true > spark.app.name -> sparkr-shell > spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041 > spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041 > spark.master -> local[*] > Classpath elements: > 16/01/21 10:44:34 INFO spark.SparkContext: Running Spark version 1.4.0 > 16/01/21 10:44:35 INFO spark.SecurityManager: Changing view acls to: sXXXX > 16/01/21 10:44:35 INFO spark.SecurityManager: Changing modify acls to: sXXXX > 16/01/21 10:44:35 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(sXXXX); users > with modify permissions: Set(sXXXX) > 16/01/21 10:44:36 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/01/21 10:44:36 INFO Remoting: Starting remoting > 16/01/21 10:44:36 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriver@99.99.99.99:99999] > 16/01/21 10:44:36 INFO util.Utils: Successfully started service 'sparkDriver' > on port 99999. > 16/01/21 10:44:36 INFO spark.SparkEnv: Registering MapOutputTracker > 16/01/21 10:44:36 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/01/21 10:44:36 INFO storage.DiskBlockManager: Created local directory at > /tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/blockmgr-8e7084f2-4b1b-465e-8ac1-5b4b3dcf44e5 > 16/01/21 10:44:36 INFO storage.MemoryStore: MemoryStore started with capacity > 265.4 MB > 16/01/21 10:44:37 INFO spark.HttpFileServer: HTTP File server directory is > /tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/httpd-61e30295-e750-4682-9420-37d5162b89c7 > 16/01/21 10:44:37 INFO spark.HttpServer: Starting HTTP Server > 16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/01/21 10:44:37 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:36797 > 16/01/21 10:44:37 INFO util.Utils: Successfully started service 'HTTP file > server' on port 36797. > 16/01/21 10:44:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/01/21 10:44:37 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/01/21 10:44:37 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/01/21 10:44:37 INFO ui.SparkUI: Started SparkUI at http://99.99.99.99:4040 > 16/01/21 10:44:37 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/01/21 10:44:37 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36799. > 16/01/21 10:44:37 INFO netty.NettyBlockTransferService: Server created on > 36799 > 16/01/21 10:44:37 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/01/21 10:44:37 INFO storage.BlockManagerMasterEndpoint: Registering block > manager localhost:36799 with 265.4 MB RAM, BlockManagerId(driver, localhost, > 36799) > 16/01/21 10:44:37 INFO storage.BlockManagerMaster: Registered BlockManager > Welcome to SparkR! > Spark context is available as sc, SQL context is available as sqlContext > During startup - Warning message: > package ‘SparkR’ was built under R version 3.1.3 > > library(SparkR) > > sc <- sparkR.init(appName="SparkR-DataFrame") > Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or > restart R to create a new Spark Context > > sqlContext <- sparkRSQL.init(sc) > > setwd("/home/sXXXX/") > > getwd() > [1] "/home/sXXXX" > > path <- file.path("Sample.csv") > > Test <- read.df(sqlContext, path) > 16/01/21 10:46:14 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/01/21 10:46:14 WARN hdfs.BlockReaderLocal: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 16/01/21 10:46:14 ERROR r.RBackendHandler: load on 1 failed > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.AssertionError: assertion failed: No schema defined, and > no Parquet data file or summary file found under . > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:443) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:385) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:505) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:504) > at > org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30) > at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230) > ... 25 more > Error: returnStatus == 0 is not TRUE > > path <- read.df(sqlContext, "/home/sXXXX/Sample.csv", source = > > "com.databricks.spark.csv") > 16/01/21 10:46:48 ERROR r.RBackendHandler: load on 1 failed > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.RuntimeException: Failed to load class for data source: > com.databricks.spark.csv > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216) > at > org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229) > at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230) > ... 25 more > Error: returnStatus == 0 is not TRUE -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org