[ 
https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312622#comment-14312622
 ] 

Josh Rosen commented on SPARK-5685:
-----------------------------------

[~nchammas], in general I'm a huge fan of runtime warnings / better exceptions, 
especially for common issues like this.  I wonder if it would be too noisy to 
log a warning every time textFile was used on compressed input; instead, what 
do you think about logging only in cases where minPartitions > 1 and the input 
is compressed?  This would cover the case where a user sees that they're not 
obtaining sufficient parallelism and then tries to increase the parallelism.

Also, what happens if the user specifies the path of a directory that contains 
many input files, some of which are compressed and some which aren't?  Does the 
driver know whether the files are compressed in an unsplitable way, or do we 
only discover this on the executors once the job runs?

> Show warning when users open text files compressed with non-splittable 
> algorithms like gzip
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5685
>                 URL: https://issues.apache.org/jira/browse/SPARK-5685
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> This is a usability or user-friendliness issue.
> It's extremely common for people to load a text file compressed with gzip, 
> process it, and then wonder why only 1 core in their cluster is doing any 
> work.
> Some examples:
> * http://stackoverflow.com/q/28127119/877069
> * http://stackoverflow.com/q/27531816/877069
> I'm not sure how this problem can be generalized, but at the very least it 
> would be helpful if Spark displayed some kind of warning in the common case 
> when someone opens a gzipped file with {{sc.textFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to