[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

lxsmnv Sun, 19 Feb 2017 18:26:27 -0800

GitHub user lxsmnv opened a pull request:

    https://github.com/apache/spark/pull/16995


    [SPARK-19340][SQL] CSV file will result in an exception if the filename 
contains special characters

    ## What changes were proposed in this pull request?
    The root cause of the problem is that when spark is inferring schema from 
the csv file, it tries to resolve the file path pattern more then once by 
calling DataSouce.resolveRelation each time.
    
    So, if we have file path like:
    <...>/test*
    and the actual file with name: test{00-1}.txt
    Then from the initial call of DataSouce.resolveRelation the pattern will be 
resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, 
it calls DataSouce.resolveRelation the second time. The second attempt to 
resolve the path pattern fails because the file name /<...>/test{00-1}.txt is 
considered as a pattern and not as actual file and if there no file that match 
that pattern the  whole DataSouce.resolveRelation fails.
    
    The idea behind the fix is quite straightforward:
    The part of DataSouce.resolveRelation that creates Hadoop Relation based on 
a resolved(actual) file names moved to separate function createHadoopRelation. 
CSVFileFormat.createBaseDataset calls this new function instead of 
DataSouce.resolveRelation, that caused unnecessary file path resolution.  
    
    ## How was this patch tested?
    manual tests
    
    This contribution is my original work and I license the work to the project 
under the projectâs open source license.
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lxsmnv/spark SPARK-19340

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16995
    
----
commit 507a929694653d49d1eb42398131743e0d004f65
Author: lxsmnv <alexse...@gmail.com>
Date:   2017-02-20T01:52:40Z

    SPARK-19340 file path resolution for csv files fixed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

Reply via email to