Re: Multiple CSV libs causes issues spark 2.1

Hyukjin Kwon Tue, 09 May 2017 16:57:06 -0700

Sounds like it is related with https://github.com/apache/spark/pull/17916


We will allow pick up the internal one if this one gets merged.

On 10 May 2017 7:09 am, "Mark Hamstra" <m...@clearstorydata.com> wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <lucas.g...@gmail.com
> > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
>>>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>>>
>>>>
>>>>
>>>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>>>> metastore. hive.metastore.schema.verification is not enabled so
>>>>> recording the schema version 1.2.0
>>>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>>>> returning NoSuchObjectException
>>>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
>>>>> global_temp, returning NoSuchObjectException
>>>>>
>>>>
>>>>
>>>>> Py4JJavaError: An error occurred while calling o72.csv.
>>>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>>>> (*com.databricks.spark.csv.DefaultSource15,
>>>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
>>>>> please specify the fully qualified class name.
>>>>> at scala.sys.package$.error(package.scala:27)
>>>>> at org.apache.spark.sql.execution.datasources.DataSource$.looku
>>>>> pDataSource(DataSource.scala:591)
>>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>>> ingClass$lzycompute(DataSource.scala:86)
>>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>>> ingClass(DataSource.scala:86)
>>>>> at org.apache.spark.sql.execution.datasources.DataSource.resolv
>>>>> eRelation(DataSource.scala:325)
>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
>>>>> ala:152)
>>>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>>>> ssorImpl.java:57)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>>> thodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>>> at py4j.Gateway.invoke(Gateway.java:280)
>>>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>>>>> ava:132)
>>>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>>>> java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>> When I change our call to:
>>>>
>>>> df = spark.hiveContext.read \
>>>>     .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>>>> \
>>>>     .load('df_in.csv)
>>>>
>>>> No such issue, I was under the impression (obviously wrongly) that
>>>> spark would automatically pick the local lib.  We have the databricks
>>>> library because other jobs still explicitly call it.
>>>>
>>>> Is the 'correct answer' to go through and modify so as to remove the
>>>> databricks lib / remove it from our deploy?  Or should this just work?
>>>>
>>>> One of the things I find less helpful in the spark docs are when
>>>> there's multiple ways to do it but no clear guidance on what those methods
>>>> are intended to accomplish.
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>

Re: Multiple CSV libs causes issues spark 2.1

Reply via email to