[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

pwendell Sat, 07 Jun 2014 10:25:07 -0700

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/455#issuecomment-45415893
  
    So after looking at this more. I think it's great to add to support for 
sequence file format in Python, it will be super useful.
    
    This patch takes things a step further and also introduces an API for users 
to plug in different types of Hadoop input formats. I'm a bit concerned about 
exposing that to users (which this patch does via the `Converter` stuff) 
because I'm sure people will start building on it and it's not an interface I'd 
be happy supporting going forward. For one thing, it only applies to Python. 
Another thing is that there are other outstanding community proposals for how 
to deal with things like HBase support (see #194). In the near future (possibly 
1.1) we're planning to standardize the way this works via SchemaRDD, and that 
will be automatically supported in Python. Basically, we'll have ways to read 
and write to a SchemaRDD from several storage systems.
    
    So I just don't think it's a good idea to introduce a different pluggable 
mechanism here and encourage users to integrate at this level, and I wonder if 
we should just merge a version of this patch that only exposes sequenceFile and 
not the more general converter mechanism.
    
    @mateiz @marmbrus curious to hear your thoughts as well



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

Reply via email to