[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
Github user petervandenabeele commented on the pull request: https://github.com/apache/spark/pull/3517#issuecomment-67032682 I committed a revert that limits the squashed diff to a small addition of a Note for the 3 tabs of Scala, Java and Python. If anything more needs to happen, glad to look into it. There is no rebase required ? I could do it in a separate PR if useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
Github user petervandenabeele commented on the pull request: https://github.com/apache/spark/pull/3517#issuecomment-66922392 Bump ... I suggest we revert to something close to my original proposal: * no change in filenames (too complex for now) * add a small(er) note in the doc about the non-standard format In our DataScienceBe project, I just got this message from a new Spark user: to reitarate (and make sure I understand correctly), the `jsonFile`function does not read valid JSON files, but rather special files containing a valid JSON object on each line. Just making this clear to the users will already avoid some frustration. Could you please confirm that I can make this proposal (or a different path to resolve this). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
Github user petervandenabeele commented on the pull request: https://github.com/apache/spark/pull/3517#issuecomment-65846515 More problematic (and sorry I had not seen that before) ... there already _is_ an example file named `people.txt` with a different format: ``` $ spark git:(pv-docs-note-on-jsonFile-format/01) cat examples/src/main/resources/people.txt Michael, 29 Andy, 30 Justin, 19 ``` In that case, I could rename the example jsonFile to `people.jsons`. It is a weird name, but it's _reasonably_ accurate (following the `xs` pattern from Scala, as it is like a list of json objects). I would then indeed also need to change the name in all other locations where a reference to `people.json` is made (confirming the list mentioned by @marmbrus): ``` spark git:(pv-docs-note-on-jsonFile-format/01) grep -r 'people\.json' * | grep -v Binary | grep -v _site examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java: String path = examples/src/main/resources/people.json; examples/src/main/python/sql.py:path = os.path.join(os.environ['SPARK_HOME'], examples/src/main/resources/people.json) ``` On a more fundamental note, from the outside, I would have perceived it following the principle of least astonishment (POLA) if the import to this function required a standard valid json file that needs to be formatted as an array of hashes with identical schema, like e.g. ``` [ {name: Tom, character:cat}, {name:Jerry, character:mouse} ] ``` This would have allowed us to simply import data generated from any other language with `array.to_json`. I hear the proposal from @marmbrus to also improve the error message (that would also have helped us in more quickly understanding the issue), but it would suggest to put that in a different JIRA issue (that needs some real programming and testing work). I look forward to directions on how to best fix at least the documentation to avoid this confusion for others. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
Github user petervandenabeele commented on the pull request: https://github.com/apache/spark/pull/3517#issuecomment-65292817 Thx @JoshRosen for your follow-up. I locally verified a squashed version of my 2 commits. The squashed change change is now very limited, affecting 6 lines with a replace of `(JSON)|(json)` by `txt`. I hope it avoids the confusion I faced in trying to feed a genuine json file to `sqlContext.jsonFile(path)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
Github user petervandenabeele commented on the pull request: https://github.com/apache/spark/pull/3517#issuecomment-65101060 @JoshRosen Good idea. Interestingly, the existing text already says a bit lower: ``` // The path can be either a single text file or a directory storing text files. val path = examples/src/main/resources/people.json ``` I would suggest to then also rename the example file to ``` val path = examples/src/main/resources/people.txt ``` to make clear it is _not_ really a .json file. I will think about it and may submit a next version of the patch (which will result in a smaller diff then). Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02) that I rebase of current master and only has the actual change (and not the initial change that was too verbose) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...
GitHub user petervandenabeele opened a pull request: https://github.com/apache/spark/pull/3517 Add a Note on jsonFile having separate JSON objects per line * This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html You can merge this pull request into a Git repository by running: $ git pull https://github.com/petervandenabeele/spark pv-docs-note-on-jsonFile-format/01 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3517.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3517 commit fca7dfbf893af06065f719aa0c5cf6a99d3aad37 Author: Peter Vandenabeele pe...@vandenabeele.com Date: 2014-11-30T16:47:58Z Add a Note on jsonFile having separate JSON objects per line * This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org