[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2017-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/41


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-19 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71443174
  
--- Diff: c++/src/Reader.cc ---
@@ -1062,37 +1062,36 @@ namespace orc {
 // PASS
   }
 
+  RowReader::~RowReader() {
+// PASS
+  }
+
   static const uint64_t DIRECTORY_SIZE_GUESS = 16 * 1024;
 
-  class ReaderImpl : public Reader {
+  class ReaderImpl;
+
+  class RowReaderImpl : public RowReader {
   private:
 const Timezone& localTimezone;
 
 // inputs
-std::unique_ptr stream;
-ReaderOptions options;
-const uint64_t fileLength;
-const uint64_t postscriptLength;
 std::vector selectedColumns;
-
+std::shared_ptr stream;
--- End diff --

I'd suggest creating a single class that has all of the state for the 
ReaderImpl. Then it can have the postscript, InputStream, etc. and be shared 
between the ReaderImpl and the set of RowReaderImpls that share the same 
ReaderImpl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-19 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71379015
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -782,85 +847,42 @@ namespace orc {
 getColumnStatistics(uint32_t columnId) const = 0;
 
 /**
- * Get the type of the rows in the file. The top level is typically a
- * struct.
- * @return the root type
- */
-virtual const Type& getType() const = 0;
-
-/**
- * Get the selected type of the rows in the file. The file's row type
- * is projected down to just the selected columns. Thus, if the file's
- * type is struct and the selected
- * columns are "col0,col2" the selected type would be
- * struct.
- * @return the root type
- */
-virtual const Type& getSelectedType() const = 0;
-
-/**
- * Get the selected columns of the file.
+ * check file has correct column statistics
  */
-virtual const std::vector getSelectedColumns() const = 0;
+virtual bool hasCorrectStatistics() const = 0;
 
 /**
- * Create a row batch for reading the selected columns of this file.
- * @param size the number of rows to read
- * @return a new ColumnVectorBatch to read into
+ * Get the serialized file tail.
+ * Usefull if another reader of the same file wants to avoid re-reading
+ * the file tail. See ReaderOptions.setSerializedFileTail().
+ * @return a string of bytes with the file tail
  */
-virtual ORC_UNIQUE_PTR createRowBatch(uint64_t size
- ) const = 0;
+virtual std::string getSerializedFileTail() const = 0;
 
 /**
- * Read the next row batch from the current position.
- * Caller must look at numElements in the row batch to determine how
- * many rows were read.
- * @param data the row batch to read into.
- * @return true if a non-zero number of rows were read or false if the
- *   end of the file was reached.
+ * Get the type of the rows in the file. The top level is typically a
+ * struct.
+ * @return the root type
  */
-virtual bool next(ColumnVectorBatch& data) = 0;
+virtual const Type& getType() const = 0;
 
 /**
- * Get the row number of the first row in the previously read batch.
- * @return the row number of the previous batch.
+ * @return a RowReader to read the rows
  */
-virtual uint64_t getRowNumber() const = 0;
+virtual ORC_UNIQUE_PTR getRowReader() const = 0;
 
 /**
- * Seek to a given row.
- * @param rowNumber the next row the reader should return
+ * @param include update with new columns
+ * @return a RowReader to read the rows
  */
-virtual void seekToRow(uint64_t rowNumber) = 0;
+virtual ORC_UNIQUE_PTR
+getRowReader(const std::list& include) const = 0;
--- End diff --

Let's go ahead and make a RowReaderOptions class and pass that in here. It 
is almost guaranteed that the include vector will not be the only option that 
we want to pass in. One of the options will be to specify the include vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-19 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71378147
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -645,7 +645,72 @@ namespace orc {
   };
 
   /**
-   * The interface for reading ORC files.
+   * The interface for reading rows in ORC files.
+   * This is an an abstract class that will subclassed as necessary.
+   */
+  class RowReader {
+  public:
+virtual ~RowReader();
+/**
+ * Get the selected type of the rows in the file. The file's row type
+ * is projected down to just the selected columns. Thus, if the file's
+ * type is struct and the selected
+ * columns are "col0,col2" the selected type would be
+ * struct.
+ * @return the root type
+ */
+virtual const Type& getSelectedType() const = 0;
+
+/**
+ * Get the selected columns of the file.
+ */
+virtual const std::vector getSelectedColumns() const = 0;
+
+/**
+ * Create a row batch for reading the selected columns of this file.
+ * @param size the number of rows to read
+ * @return a new ColumnVectorBatch to read into
+ */
+virtual ORC_UNIQUE_PTR createRowBatch(uint64_t size
+ ) const = 0;
+
+/**
+ * Read the next row batch from the current position.
+ * Caller must look at numElements in the row batch to determine how
+ * many rows were read.
+ * @param data the row batch to read into.
+ * @return true if a non-zero number of rows were read or false if the
+ *   end of the file was reached.
+ */
+virtual bool next(ColumnVectorBatch& data) = 0;
+
+/**
+ * Get the row number of the first row in the previously read batch.
+ * @return the row number of the previous batch.
+ */
+virtual uint64_t getRowNumber() const = 0;
+
+/**
+ * Seek to a given row.
+ * @param rowNumber the next row the reader should return
+ */
+virtual void seekToRow(uint64_t rowNumber) = 0;
+  
+/**
+ * Estimate an upper bound on heap memory allocation by the Reader
+ * based on the information in the file footer.
+ * The bound is less tight if only few columns are read or compression 
is
+ * used.
+ * @param stripeIx index of the stripe to be read (if not specified,
+ *all stripes are considered).
+ * @return upper bound on memory use
+ */
+virtual uint64_t getMemoryUse(int stripeIx=-1) = 0;
--- End diff --

This should stay in the reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-19 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71377966
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -645,7 +645,72 @@ namespace orc {
   };
 
   /**
-   * The interface for reading ORC files.
+   * The interface for reading rows in ORC files.
+   * This is an an abstract class that will subclassed as necessary.
+   */
+  class RowReader {
+  public:
+virtual ~RowReader();
+/**
+ * Get the selected type of the rows in the file. The file's row type
+ * is projected down to just the selected columns. Thus, if the file's
+ * type is struct and the selected
+ * columns are "col0,col2" the selected type would be
+ * struct.
+ * @return the root type
+ */
+virtual const Type& getSelectedType() const = 0;
+
+/**
+ * Get the selected columns of the file.
+ */
+virtual const std::vector getSelectedColumns() const = 0;
+
--- End diff --

You have some trailing spaces. You can probably configure your IDE to 
automatically trim them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-15 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71029025
  
--- Diff: c++/src/Reader.cc ---
@@ -1108,13 +1107,75 @@ namespace orc {
 // internal methods
 proto::StripeFooter getStripeFooter(const proto::StripeInformation& 
info);
 void startNextStripe();
-void checkOrcVersion();
 void selectType(const Type& type);
-void readMetadata() const;
 void updateSelected(const std::list& fieldIds);
 void updateSelected(const std::list& fieldNames);
 
   public:
+   /**
+* Constructor that lets the user specify additional options.
+* @param filereader the object to read from
+* @param options options for reading
+*/
+RowReaderImpl(const ReaderImpl* filereader,
--- End diff --

We will have to extend the `Reader` interface with a notion of `metadata`, 
`ReaderOptions` etc. to achieve this. I am not sure what the right design 
choice is here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-15 Thread swalkaus
Github user swalkaus commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71004575
  
--- Diff: c++/src/Reader.cc ---
@@ -1108,13 +1107,75 @@ namespace orc {
 // internal methods
 proto::StripeFooter getStripeFooter(const proto::StripeInformation& 
info);
 void startNextStripe();
-void checkOrcVersion();
 void selectType(const Type& type);
-void readMetadata() const;
 void updateSelected(const std::list& fieldIds);
 void updateSelected(const std::list& fieldNames);
 
   public:
+   /**
+* Constructor that lets the user specify additional options.
+* @param filereader the object to read from
+* @param options options for reading
+*/
+RowReaderImpl(const ReaderImpl* filereader,
--- End diff --

Should filereader be a "Reader" or "ReaderImpl"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-15 Thread swalkaus
Github user swalkaus commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r71004296
  
--- Diff: c++/src/Reader.cc ---
@@ -1108,13 +1107,75 @@ namespace orc {
 // internal methods
 proto::StripeFooter getStripeFooter(const proto::StripeInformation& 
info);
 void startNextStripe();
-void checkOrcVersion();
 void selectType(const Type& type);
-void readMetadata() const;
 void updateSelected(const std::list& fieldIds);
 void updateSelected(const std::list& fieldNames);
 
   public:
+   /**
+* Constructor that lets the user specify additional options.
+* @param filereader the object to read from
+* @param options options for reading
+*/
+RowReaderImpl(const ReaderImpl* filereader,
+   std::shared_ptr options);
+
+uint64_t getMemoryUse(int stripeIx = -1) override;
+
+const std::vector getSelectedColumns() const override;
+
+const Type& getSelectedType() const override;
+
+std::unique_ptr createRowBatch(uint64_t size
+  ) const override;
+
+bool next(ColumnVectorBatch& data) override;
+
+const ReaderOptions& getReaderOptions() const;
+
+CompressionKind getCompression() const;
+
+uint64_t getCompressionSize() const;
+
+uint64_t getRowNumber() const override;
+
+void seekToRow(uint64_t rowNumber) override;
+
+MemoryPool* getMemoryPool() const ;
+
+  };
+
+  class ReaderImpl : public Reader {
+friend class RowReaderImpl;
--- End diff --

Avoid using friend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-15 Thread swalkaus
Github user swalkaus commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r70968789
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -645,7 +645,72 @@ namespace orc {
   };
 
   /**
-   * The interface for reading ORC files.
+   * The interface for reading rows in ORC files.
+   * This is an an abstract class that will subclassed as necessary.
+   */
+  class RowReader {
+  public:
+virtual ~RowReader();
+/**
+ * Get the selected type of the rows in the file. The file's row type
+ * is projected down to just the selected columns. Thus, if the file's
+ * type is struct and the selected
+ * columns are "col0,col2" the selected type would be
+ * struct.
+ * @return the root type
+ */
+virtual const Type& getSelectedType() const = 0;
+
+/**
+ * Get the selected columns of the file.
+ */
+virtual const std::vector getSelectedColumns() const = 0;
+
+/**
+ * Create a row batch for reading the selected columns of this file.
+ * @param size the number of rows to read
+ * @return a new ColumnVectorBatch to read into
+ */
+virtual ORC_UNIQUE_PTR createRowBatch(uint64_t size
+ ) const = 0;
+
+/**
+ * Read the next row batch from the current position.
+ * Caller must look at numElements in the row batch to determine how
+ * many rows were read.
+ * @param data the row batch to read into.
+ * @return true if a non-zero number of rows were read or false if the
+ *   end of the file was reached.
+ */
+virtual bool next(ColumnVectorBatch& data) = 0;
+
+/**
+ * Get the row number of the first row in the previously read batch.
+ * @return the row number of the previous batch.
+ */
+virtual uint64_t getRowNumber() const = 0;
+
+/**
+ * Seek to a given row.
+ * @param rowNumber the next row the reader should return
+ */
+virtual void seekToRow(uint64_t rowNumber) = 0;
+  
+/**
+ * Estimate an upper bound on heap memory allocation by the Reader
+ * based on the information in the file footer.
+ * The bound is less tight if only few columns are read or compression 
is
+ * used.
+ * @param stripeIx index of the stripe to be read (if not specified,
+ *all stripes are considered).
+ * @return upper bound on memory use
+ */
+virtual uint64_t getMemoryUse(int stripeIx=-1) = 0;
+ 
+  };
+
+  /**
+   * The interface for reading footer in ORC files.
--- End diff --

"The interface for reading ORC file meta-data."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-07-15 Thread swalkaus
Github user swalkaus commented on a diff in the pull request:

https://github.com/apache/orc/pull/41#discussion_r70967406
  
--- Diff: tools/src/FileContents.cc ---
@@ -27,8 +27,10 @@
 #include 
 
 void printContents(const char* filename, const orc::ReaderOptions opts) {
-  std::unique_ptr reader;
-  reader = orc::createReader(orc::readLocalFile(std::string(filename)), 
opts);
+  std::unique_ptr fileReader;
--- End diff --

Pendantic: use reader and rowReader consistently (notice the camelCase).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #41: ORC-58: Move code for reading rows from Reader to RowR...

2016-06-19 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/41

ORC-58: Move code for reading rows from Reader to RowReader

This PR divides the current Reader class into Reader class and RowReader 
class
Reader class reads the footer and metadata from an ORC file.
RowReader class reads the rows from the file.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc RowReader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/41.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #41


commit ed02d7ab668c1a61f8260e0efddf06c617658753
Author: Deepak Majeti 
Date:   2016-06-05T21:51:45Z

Split curent Reader into RowReader and Reader




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---