This looks more like a matter for Databricks support than spark-user. On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote:
> df = spark.sqlContext.read.csv('out/df_in.csv') >> > > >> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in >> metastore. hive.metastore.schema.verification is not enabled so >> recording the schema version 1.2.0 >> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default, >> returning NoSuchObjectException >> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp, >> returning NoSuchObjectException >> > > >> Py4JJavaError: An error occurred while calling o72.csv. >> : java.lang.RuntimeException: Multiple sources found for csv >> (*com.databricks.spark.csv.DefaultSource15, >> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please >> specify the fully qualified class name. >> at scala.sys.package$.error(package.scala:27) >> at org.apache.spark.sql.execution.datasources. >> DataSource$.lookupDataSource(DataSource.scala:591) >> at org.apache.spark.sql.execution.datasources.DataSource.providingClass$ >> lzycompute(DataSource.scala:86) >> at org.apache.spark.sql.execution.datasources.DataSource.providingClass( >> DataSource.scala:86) >> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation( >> DataSource.scala:325) >> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) >> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke( >> NativeMethodAccessorImpl.java:57) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke( >> DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) >> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) >> at py4j.Gateway.invoke(Gateway.java:280) >> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) >> at py4j.commands.CallCommand.execute(CallCommand.java:79) >> at py4j.GatewayConnection.run(GatewayConnection.java:214) at >> java.lang.Thread.run(Thread.java:745) > > > When I change our call to: > > df = spark.hiveContext.read \ > .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') > \ > .load('df_in.csv) > > No such issue, I was under the impression (obviously wrongly) that spark > would automatically pick the local lib. We have the databricks library > because other jobs still explicitly call it. > > Is the 'correct answer' to go through and modify so as to remove the > databricks lib / remove it from our deploy? Or should this just work? > > One of the things I find less helpful in the spark docs are when there's > multiple ways to do it but no clear guidance on what those methods are > intended to accomplish. > > Thanks! >