Re: Multiple CSV libs causes issues spark 2.1

lucas.g...@gmail.com Tue, 09 May 2017 17:13:00 -0700

>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -
    df = spark.read.csv('out/df_in.csv')
sparkSession itself is in entry point to dataframes and SQL functionality .




our bootstrap is a bit messy, in our case no.  In the general case yes.

On 9 May 2017 at 16:56, Pushkar.Gujar <pushkarvgu...@gmail.com> wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
> shouldn't this be just -
>
>     df = spark.read.csv('out/df_in.csv')
>
> sparkSession itself is in entry point to dataframes and SQL functionality .
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Looks to me like it is a conflict between a Databricks library and Spark
>> 2.1. That's an issue for Databricks to resolve or provide guidance.
>>
>> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> I'm a bit confused by that answer, I'm assuming it's spark deciding
>>> which lib to use.
>>>
>>> On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote:
>>>
>>>> This looks more like a matter for Databricks support than spark-user.
>>>>
>>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>>> lucas.g...@gmail.com> wrote:
>>>>
>>>>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>>>>
>>>>>
>>>>>
>>>>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>>>>> metastore. hive.metastore.schema.verification is not enabled so
>>>>>> recording the schema version 1.2.0
>>>>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>>>>> returning NoSuchObjectException
>>>>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
>>>>>> global_temp, returning NoSuchObjectException
>>>>>>
>>>>>
>>>>>
>>>>>> Py4JJavaError: An error occurred while calling o72.csv.
>>>>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>>>>> (*com.databricks.spark.csv.DefaultSource15,
>>>>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
>>>>>> please specify the fully qualified class name.
>>>>>> at scala.sys.package$.error(package.scala:27)
>>>>>> at org.apache.spark.sql.execution.datasources.DataSource$.looku
>>>>>> pDataSource(DataSource.scala:591)
>>>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>>>> ingClass$lzycompute(DataSource.scala:86)
>>>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>>>> ingClass(DataSource.scala:86)
>>>>>> at org.apache.spark.sql.execution.datasources.DataSource.resolv
>>>>>> eRelation(DataSource.scala:325)
>>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
>>>>>> ala:152)
>>>>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.sca
>>>>>> la:415)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>>>>> ssorImpl.java:57)
>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>>>> thodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>>>> at py4j.Gateway.invoke(Gateway.java:280)
>>>>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>>>>>> ava:132)
>>>>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>>
>>>>> When I change our call to:
>>>>>
>>>>> df = spark.hiveContext.read \
>>>>>     
>>>>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>>>>> \
>>>>>     .load('df_in.csv)
>>>>>
>>>>> No such issue, I was under the impression (obviously wrongly) that
>>>>> spark would automatically pick the local lib.  We have the databricks
>>>>> library because other jobs still explicitly call it.
>>>>>
>>>>> Is the 'correct answer' to go through and modify so as to remove the
>>>>> databricks lib / remove it from our deploy?  Or should this just work?
>>>>>
>>>>> One of the things I find less helpful in the spark docs are when
>>>>> there's multiple ways to do it but no clear guidance on what those methods
>>>>> are intended to accomplish.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Multiple CSV libs causes issues spark 2.1

Reply via email to