[GitHub] [spark] HyukjinKwon commented on a change in pull request #26958: [SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in

2019-12-20 Thread GitBox
HyukjinKwon commented on a change in pull request #26958: 
[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 
'pathGlobFilter' in file sources 'mergeSchema' in ORC
URL: https://github.com/apache/spark/pull/26958#discussion_r360619514
 
 

 ##
 File path: python/pyspark/sql/readwriter.py
 ##
 @@ -520,20 +537,24 @@ def func(iterator):
 raise TypeError("path can be only string, list or RDD")
 
 @since(1.5)
-def orc(self, path, mergeSchema=None, recursiveFileLookup=None):
+def orc(self, path, mergeSchema=None, pathGlobFilter=None, 
recursiveFileLookup=None):
 """Loads ORC files, returning the result as a :class:`DataFrame`.
 
 :param mergeSchema: sets whether we should merge schemas collected 
from all
 ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
 The default value is specified in ``spark.sql.orc.mergeSchema``.
+:param pathGlobFilter: an optional glob pattern to only include files 
with paths matching
+   the pattern. The syntax follows 
`org.apache.hadoop.fs.GlobFilter`.
+   It does not change the behavior of `partition 
discovery`_.
 :param recursiveFileLookup: recursively scan a directory for files. 
Using this option
-disables `partition discovery`_.
+disables `partition discovery`_.
 
 Review comment:
   So .. if you dont mind, I would like this run that separately :-)..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #26958: [SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in

2019-12-20 Thread GitBox
HyukjinKwon commented on a change in pull request #26958: 
[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 
'pathGlobFilter' in file sources 'mergeSchema' in ORC
URL: https://github.com/apache/spark/pull/26958#discussion_r360619476
 
 

 ##
 File path: python/pyspark/sql/readwriter.py
 ##
 @@ -520,20 +537,24 @@ def func(iterator):
 raise TypeError("path can be only string, list or RDD")
 
 @since(1.5)
-def orc(self, path, mergeSchema=None, recursiveFileLookup=None):
+def orc(self, path, mergeSchema=None, pathGlobFilter=None, 
recursiveFileLookup=None):
 """Loads ORC files, returning the result as a :class:`DataFrame`.
 
 :param mergeSchema: sets whether we should merge schemas collected 
from all
 ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
 The default value is specified in ``spark.sql.orc.mergeSchema``.
+:param pathGlobFilter: an optional glob pattern to only include files 
with paths matching
+   the pattern. The syntax follows 
`org.apache.hadoop.fs.GlobFilter`.
+   It does not change the behavior of `partition 
discovery`_.
 :param recursiveFileLookup: recursively scan a directory for files. 
Using this option
-disables `partition discovery`_.
+disables `partition discovery`_.
 
 Review comment:
   Since we're going ahead for Spark 3, we wont likely backport many things 
that cause conflicts. So I was thinking it's a feasible option.
   
   Also, I think we might have to document this first that virtical alignment 
isn't preferred. I think virtical alignment is still valid per PEP 8 and PEP 
257.
   
   One downside of doing bit by bit is a confusion by mixed style. Considering 
that we wont likely add many new docstrings, mixed style exists in a long term.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org