[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-12-15 Thread petervandenabeele
Github user petervandenabeele commented on the pull request:

https://github.com/apache/spark/pull/3517#issuecomment-67032682
  
I committed a revert that limits the squashed diff to a small addition of a 
Note for the 3 tabs of Scala, Java and Python.

If anything more needs to happen, glad to look into it.

There is no rebase required ? I could do it in a separate PR if useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-12-14 Thread petervandenabeele
Github user petervandenabeele commented on the pull request:

https://github.com/apache/spark/pull/3517#issuecomment-66922392
  
Bump ...

I suggest we revert to something close to my original proposal:
* no change in filenames (too complex for now)
* add a small(er) note in the doc about the non-standard format

In our DataScienceBe project, I just got this message from a new Spark user:

to reitarate (and make sure I understand correctly), the 
`jsonFile`function does not read valid JSON files, but rather special files 
containing a valid JSON object on each line.

Just making this clear to the users will already avoid some frustration.

Could you please confirm that I can make this proposal (or a different path 
to resolve this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-12-05 Thread petervandenabeele
Github user petervandenabeele commented on the pull request:

https://github.com/apache/spark/pull/3517#issuecomment-65846515
  
More problematic (and sorry I had not seen that before) ... there already 
_is_ an example file named `people.txt` with a different format:

```
$ spark git:(pv-docs-note-on-jsonFile-format/01) cat 
examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
```

In that case, I could rename the example jsonFile to `people.jsons`. It is 
a weird name, but it's _reasonably_ accurate (following the `xs` pattern from 
Scala, as it is like a list of json objects).

I would then indeed also need to change the name in all other locations 
where a reference to `people.json` is made (confirming the list mentioned by 
@marmbrus): 

```
spark git:(pv-docs-note-on-jsonFile-format/01) grep -r 'people\.json' * | 
grep -v Binary | grep -v _site 
examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java:
String path = examples/src/main/resources/people.json;
examples/src/main/python/sql.py:path = 
os.path.join(os.environ['SPARK_HOME'], 
examples/src/main/resources/people.json)
```

On a more fundamental note, from the outside, I would have perceived it 
following the principle of least astonishment (POLA) if the import to this 
function required a standard valid json file that needs to be formatted as an 
array of hashes with identical schema, like e.g.

```
[
  {name: Tom,
   character:cat},
  {name:Jerry,
   character:mouse}
]
```
This would have allowed us to simply import data generated from any other 
language with `array.to_json`. 

I hear the proposal from @marmbrus to also improve the error message (that 
would also have helped us in more quickly understanding the issue), but it 
would suggest to put that in a different JIRA issue (that needs some real 
programming and testing work).

I look forward to directions on how to best fix at least the documentation 
to avoid this confusion for others.

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-12-02 Thread petervandenabeele
Github user petervandenabeele commented on the pull request:

https://github.com/apache/spark/pull/3517#issuecomment-65292817
  
Thx @JoshRosen for your follow-up.

I locally verified a squashed version of my 2 commits. The squashed change 
change is now very limited, affecting 6 lines with a replace of `(JSON)|(json)` 
by `txt`.

I hope it avoids the confusion I faced in trying to feed a genuine json 
file to `sqlContext.jsonFile(path)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-12-01 Thread petervandenabeele
Github user petervandenabeele commented on the pull request:

https://github.com/apache/spark/pull/3517#issuecomment-65101060
  
@JoshRosen Good idea. Interestingly, the existing text already says a bit 
lower:

```
// The path can be either a single text file or a directory storing text 
files.  
val path = examples/src/main/resources/people.json
```

I would suggest to then also rename the example file to

```
val path = examples/src/main/resources/people.txt
```
to make clear it is _not_ really a .json file.
I will think about it and may submit a next version of the patch
(which will result in a smaller diff then).

Would it not be better to start a new branch 
(pv-docs-note-on-jsonFile-format/02)
that I rebase of current master and only has the actual change (and not the
initial change that was too verbose) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add a Note on jsonFile having separate JSON ob...

2014-11-30 Thread petervandenabeele
GitHub user petervandenabeele opened a pull request:

https://github.com/apache/spark/pull/3517

Add a Note on jsonFile having separate JSON objects per line

* This commit hopes to avoid the confusion I faced when trying
  to submit a regular, valid multi-line JSON file, also see

  
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petervandenabeele/spark 
pv-docs-note-on-jsonFile-format/01

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3517


commit fca7dfbf893af06065f719aa0c5cf6a99d3aad37
Author: Peter Vandenabeele pe...@vandenabeele.com
Date:   2014-11-30T16:47:58Z

Add a Note on jsonFile having separate JSON objects per line

* This commit hopes to avoid the confusion I faced when trying
  to submit a regular, valid multi-line JSON file, also see

  
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org