Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -
df = spark.read.csv('out/df_in.csv')
sparkSession itself is in entry point to dataframes and SQL functionality .



our bootstrap is a bit messy, in our case no.  In the general case yes.

On 9 May 2017 at 16:56, Pushkar.Gujar  wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
> shouldn't this be just -
>
> df = spark.read.csv('out/df_in.csv')
>
> sparkSession itself is in entry point to dataframes and SQL functionality .
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra 
> wrote:
>
>> Looks to me like it is a conflict between a Databricks library and Spark
>> 2.1. That's an issue for Databricks to resolve or provide guidance.
>>
>> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> I'm a bit confused by that answer, I'm assuming it's spark deciding
>>> which lib to use.
>>>
>>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>>
 This looks more like a matter for Databricks support than spark-user.

 On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
 lucas.g...@gmail.com> wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
>> global_temp, returning NoSuchObjectException
>>
>
>
>> Py4JJavaError: An error occurred while calling o72.csv.
>> : java.lang.RuntimeException: Multiple sources found for csv 
>> (*com.databricks.spark.csv.DefaultSource15,
>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
>> please specify the fully qualified class name.
>> at scala.sys.package$.error(package.scala:27)
>> at org.apache.spark.sql.execution.datasources.DataSource$.looku
>> pDataSource(DataSource.scala:591)
>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>> ingClass$lzycompute(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>> ingClass(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.resolv
>> eRelation(DataSource.scala:325)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
>> ala:152)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.sca
>> la:415)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>> ava:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>> java.lang.Thread.run(Thread.java:745)
>
>
> When I change our call to:
>
> df = spark.hiveContext.read \
> 
> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
> \
> .load('df_in.csv)
>
> No such issue, I was under the impression (obviously wrongly) that
> spark would automatically pick the local lib.  We have the databricks
> library because other jobs still explicitly call it.
>
> Is the 'correct answer' to go through and modify so as to remove the
> databricks lib / remove it from our deploy?  Or should this just work?
>
> One of the things I find less helpful in the spark docs are when
> there's multiple ways to do it but no clear guidance on what those methods
> are intended to accomplish.
>
> Thanks!
>


>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Pushkar.Gujar
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -

df = spark.read.csv('out/df_in.csv')

sparkSession itself is in entry point to dataframes and SQL functionality .


Thank you,
*Pushkar Gujar*


On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra 
wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916

We will allow pick up the internal one if this one gets merged.

On 10 May 2017 7:09 am, "Mark Hamstra"  wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
Looks to me like it is a conflict between a Databricks library and Spark
2.1. That's an issue for Databricks to resolve or provide guidance.

On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com 
wrote:

> I'm a bit confused by that answer, I'm assuming it's spark deciding which
> lib to use.
>
> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>
>> This looks more like a matter for Databricks support than spark-user.
>>
>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> df = spark.sqlContext.read.csv('out/df_in.csv')

>>>
>>>
 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
 metastore. hive.metastore.schema.verification is not enabled so
 recording the schema version 1.2.0
 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
 returning NoSuchObjectException
 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
 returning NoSuchObjectException

>>>
>>>
 Py4JJavaError: An error occurred while calling o72.csv.
 : java.lang.RuntimeException: Multiple sources found for csv 
 (*com.databricks.spark.csv.DefaultSource15,
 org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
 specify the fully qualified class name.
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.execution.datasources.DataSource$.looku
 pDataSource(DataSource.scala:591)
 at org.apache.spark.sql.execution.datasources.DataSource.provid
 ingClass$lzycompute(DataSource.scala:86)
 at org.apache.spark.sql.execution.datasources.DataSource.provid
 ingClass(DataSource.scala:86)
 at org.apache.spark.sql.execution.datasources.DataSource.resolv
 eRelation(DataSource.scala:325)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
 ssorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
 thodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:280)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:214) at
 java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> When I change our call to:
>>>
>>> df = spark.hiveContext.read \
>>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>>> \
>>> .load('df_in.csv)
>>>
>>> No such issue, I was under the impression (obviously wrongly) that spark
>>> would automatically pick the local lib.  We have the databricks library
>>> because other jobs still explicitly call it.
>>>
>>> Is the 'correct answer' to go through and modify so as to remove the
>>> databricks lib / remove it from our deploy?  Or should this just work?
>>>
>>> One of the things I find less helpful in the spark docs are when there's
>>> multiple ways to do it but no clear guidance on what those methods are
>>> intended to accomplish.
>>>
>>> Thanks!
>>>
>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
I'm a bit confused by that answer, I'm assuming it's spark deciding which
lib to use.

On 9 May 2017 at 14:30, Mark Hamstra  wrote:

> This looks more like a matter for Databricks support than spark-user.
>
> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com  > wrote:
>
>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>
>>
>>
>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>> metastore. hive.metastore.schema.verification is not enabled so
>>> recording the schema version 1.2.0
>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>> returning NoSuchObjectException
>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>>> returning NoSuchObjectException
>>>
>>
>>
>>> Py4JJavaError: An error occurred while calling o72.csv.
>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>> (*com.databricks.spark.csv.DefaultSource15,
>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>>> specify the fully qualified class name.
>>> at scala.sys.package$.error(package.scala:27)
>>> at org.apache.spark.sql.execution.datasources.DataSource$.
>>> lookupDataSource(DataSource.scala:591)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass$lzycompute(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> resolveRelation(DataSource.scala:325)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:57)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>> java.lang.Thread.run(Thread.java:745)
>>
>>
>> When I change our call to:
>>
>> df = spark.hiveContext.read \
>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>> \
>> .load('df_in.csv)
>>
>> No such issue, I was under the impression (obviously wrongly) that spark
>> would automatically pick the local lib.  We have the databricks library
>> because other jobs still explicitly call it.
>>
>> Is the 'correct answer' to go through and modify so as to remove the
>> databricks lib / remove it from our deploy?  Or should this just work?
>>
>> One of the things I find less helpful in the spark docs are when there's
>> multiple ways to do it but no clear guidance on what those methods are
>> intended to accomplish.
>>
>> Thanks!
>>
>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
This looks more like a matter for Databricks support than spark-user.

On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com 
wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>> returning NoSuchObjectException
>>
>
>
>> Py4JJavaError: An error occurred while calling o72.csv.
>> : java.lang.RuntimeException: Multiple sources found for csv 
>> (*com.databricks.spark.csv.DefaultSource15,
>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>> specify the fully qualified class name.
>> at scala.sys.package$.error(package.scala:27)
>> at org.apache.spark.sql.execution.datasources.
>> DataSource$.lookupDataSource(DataSource.scala:591)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass$
>> lzycompute(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass(
>> DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(
>> DataSource.scala:325)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>> java.lang.Thread.run(Thread.java:745)
>
>
> When I change our call to:
>
> df = spark.hiveContext.read \
> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
> \
> .load('df_in.csv)
>
> No such issue, I was under the impression (obviously wrongly) that spark
> would automatically pick the local lib.  We have the databricks library
> because other jobs still explicitly call it.
>
> Is the 'correct answer' to go through and modify so as to remove the
> databricks lib / remove it from our deploy?  Or should this just work?
>
> One of the things I find less helpful in the spark docs are when there's
> multiple ways to do it but no clear guidance on what those methods are
> intended to accomplish.
>
> Thanks!
>


Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
> specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:591)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


When I change our call to:

df = spark.hiveContext.read \
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
\
.load('df_in.csv)

No such issue, I was under the impression (obviously wrongly) that spark
would automatically pick the local lib.  We have the databricks library
because other jobs still explicitly call it.

Is the 'correct answer' to go through and modify so as to remove the
databricks lib / remove it from our deploy?  Or should this just work?

One of the things I find less helpful in the spark docs are when there's
multiple ways to do it but no clear guidance on what those methods are
intended to accomplish.

Thanks!