subject:"\[jira\] Updated\: \(PIG\-1117\) Pig reading hive columnar rc tables"

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2010-03-19 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Attachment: PIG-1117-0.7.0-reviewed.patch

Minor review changes, all superficial.

- changed the spacing to confirm to project conventions
- spaces before / after the curly braces where I saw them missing
- spelling and occasional references to HiveRCLoader in the docs (you've
renamed it to HiveColumnarLoader)
- minor tweak to get rid of one remaining deprecation warning in the
RecordReader

Tests pass on my machine.

Gerrit, if you are ok with these changes, I will commit.

Pig reading hive columnar rc tables
---

Key: PIG-1117
URL: https://issues.apache.org/jira/browse/PIG-1117
Project: Pig
Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Fix For: 0.7.0

Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch,
PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch

I've coded a LoadFunc implementation that can read from Hive Columnar RC
tables, this is needed for a project that I'm working on because all our data
is stored using the Hive thrift serialized Columnar RC format. I have looked
at the piggy bank but did not find any implementation that could do this.
We've been running it on our cluster for the last week and have worked out
most bugs.

There are still some improvements to be done but I would need like setting
the amount of mappers based on date partitioning. Its been optimized so as to
read only specific columns and can churn through a data set almost 8 times
faster with this improvement because not all column data is read.
I would like to contribute the class to the piggybank can you guide me in
what I need to do?
I've used hive specific classes to implement this, is it possible to add this
to the piggy bank build ivy for automatic download of the dependencies?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2010-03-19 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Attachment: PIG-1117-0.7.0-reviewed.patch

Attaching again -- forgot to click the license check box.
Which reminded me to check for Apache license headers in the new files, and
turns out they were missing -- so I added them. Assuming that's ok since Gerrit
granted license for the patches when he attached them to the Jira.

Pig reading hive columnar rc tables
---

Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch,
PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch,
PIG-117-v.0.7.0.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2010-03-19 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Resolution: Fixed
Status: Resolved (was: Patch Available)

Patch commited.
Thanks for this contribution, Gerrit! This will really help people who are
working with both Hive and Pig.

Now we just need a Zebra SerDe... :-)

Pig reading hive columnar rc tables
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2010-03-17 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Status: Open  (was: Patch Available)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117-0.7.0-new.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, 
 PIG-117-v.0.7.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2010-03-17 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: PIG-1117-0.7.0-new.patch

HiveColumnarLoader with version 0.5.0 of Hive

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117-0.7.0-new.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, 
 PIG-117-v.0.7.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-23 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Affects Version/s: (was: 0.6.0)
   Status: Open  (was: Patch Available)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-23 Thread Gerrit Jansen van Vuuren (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: PIG-117-v.0.7.0.patch

Changes:
- Slicing done per block and not per file.
- Automatic download of hive dependencies from the apache website. This is
only done once.
- Added empty implementation for fieldsToRead (will implement this soon).
- Refactored out code duplication.
- Changed Byte value to be cast to Integer
- Changed Boolean values to be 1 if true else 0

Test: ant hive-test
Jar: ant hive-jar

Dependencies:
The hive_exec.jar needs to be either in the classpath for all task nodes or
registered in the pig script
e.g REGISTER hive_exec.jar
REGISTER piggybank.jar

Pig reading hive columnar rc tables
---

Key: PIG-1117
URL: https://issues.apache.org/jira/browse/PIG-1117
Project: Pig
Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
Fix For: 0.7.0

Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-23 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

 Tags: PIG-117-v.0.7.0.patch  (was: PIG-117-v.0.6.0.patch)
Affects Version/s: 0.7.0
   Status: Patch Available  (was: Open)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-22 Thread Gerrit Jansen van Vuuren (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: PIG-117-v.0.6.0.patch

Patch for pig version 0.6.0 (should work for previous versions at least for
0.5.0).
Contains the following:
Improved HiveRCLoader with Slicer that does the slicing correctly based on
file blocks. Previous version just read the whole file and all its associated
block from one task.
Refactored to make Byte and Boolean values Integer.
Refactored to take out code duplication in setup method of HiveRCLoader.
build.xml automatically downloads the hive jars from apache website(only once
if the hive deps haven't been downloaded already).
To build piggybank jar with HiveRCLoader inside use ant hive-jar

To use the hive_exec.jar must be available to the pig jobs and the piggybank
jar plus the hive_exec.jar must be either Registered with the Pig Script or
available on the class path.

Pig reading hive columnar rc tables
---

Key: PIG-1117
URL: https://issues.apache.org/jira/browse/PIG-1117
Project: Pig
Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117.patch, PIG-117-v.0.6.0.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-22 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Status: Open  (was: Patch Available)

A refactored version of this patch will follow sortly.

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-22 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

 Tags: PIG-117-v.0.6.0.patch  (was: PIG-1117.patch)
Fix Version/s: 0.6.0
Affects Version/s: 0.6.0
 Release Note:   (was: Contains:
-build.xml updated
-HiveColumnarLoader improved
-TesthiveColumnarLoader improved
-hive dependencies to compile.
-source code separated from other non hive dependent udfs
)
   Status: Patch Available  (was: Open)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Gerrit Jansen van Vuuren
 Fix For: 0.6.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-22 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1117:


Fix Version/s: (was: 0.6.0)
   0.7.0

We already branched for 0.6.0 release so only blockers go into 0.6.0 as of now. 
This feature will be part of the next release

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-10 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: PIG-1117.patch

This patch contains the following:

Improved HiveColumnarLoader:
  - Implements Slicer interface that returns the correct number of slices 
when date filtering is used.
  - Performance improvement in how columns are read.

TestHiveColumnarLoader
   - Better Testing and improved cleanup

build.xml
- Updated build.xml file with the following tasks: hive-compile, 
hive-javadoc, hive-jar, hive-test, hive-compile-test.
   These targets do not compile hive, but compiles the udfs that depend on 
hive classes e.g HiveColumnarLoader.

lib-hivedeps
 - This contains all of the hive jars for the hive dependent udfs.
 - currently the only hive jar needed is hive-exec.jar

The hive dependent udf source and source test is separated from the rest of the 
source code like so:
 The source directory structure is:
  src/main/java
  src/main/java-hiveudfs
  src/test/java
  src/test/java-hiveudfs

This allows all other udfs that only depend on pig to compile without bothering 
with the hive dependent udfs.


To include all of the udfs and the hive dependent udfs (in this case 
HiveColumnarLoader) into the final jar type ant hive-jar.

Please comment on ideas and if this is an accepted approach for compiling and 
testing this class.


Something I've noted while compiling against the newest trunk version of pig is 
that the method signature for the LoadFunc interface has changed the method:
   From
public void fieldsToRead(Schema schema);
To 
public RequiredFieldResponse fieldsToRead(RequiredFieldList 
requiredFieldList) throws FrontendException;

So this source will only work before this change was done.



 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-10 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Tags: PIG-1117.patch  (was: Pig hive rc columnar reader)
Release Note: 
Contains:
-build.xml updated
-HiveColumnarLoader improved
-TesthiveColumnarLoader improved
-hive dependencies to compile.
-source code separated from other non hive dependent udfs


  was:This patch needs the hive_exec.jar from the 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/ build

  Status: Patch Available  (was: Open)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-07 Thread Gerrit van Vuuren

Hi,

I would like to extend the HiveColumnarRC Reader in such a way that it can tell 
Pig to only use a certain group of files, i.e. I want to filter the files and 
have Pig only use these for calculating the amount of tasks to run. I'll 
appreciate if anybody can point me in the right direction.

Cheers,
 Gerrit

-Original Message-
From: Gerrit Jansen van Vuuren (JIRA) [mailto:j...@apache.org] 
Sent: 03 December 2009 16:03
To: pig-dev@hadoop.apache.org
Subject: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: HiveColumnarLoaderTest.patch
HiveColumnarLoader.patch

Pig Storage Loader for reading from HiveColumnarRC Files

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Release Note: This patch needs the hive_exec.jar from the 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/ build
  Status: Patch Available  (was: Open)

This is a first release of the code just to get it out and have peoples 
opinions on it. It comes with a very basic unit test which borrows some 
RCFile.Writer code directly from some of the Hive RC File Tests.



 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren

 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Status: Open  (was: Patch Available)

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Gerrit Jansen van Vuuren (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: HiveColumnarLoaderTest.patch
HiveColumnarLoader.patch

Pig Storage Loader for reading from HiveColumnarRC Files

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Gerrit van Vuuren

Hi,

I've made 2 patches one for the Loader and another is the Unit Test.
It's not perfect yet but atleast this way people can start testing it and give 
some inputs.

How do I submit the patch? I tried the SubmitPatch link but could not attach 
the actual patch, then just ended up attaching it as a file.

Note that to run this you'll need to hive  the hive_exec.jar from hive 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/

Any help on how to integrate this with the ant build.xml will be appreciated.

Cheers,
 Gerrit

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Olga Natkovich

You need to do attach file first and then submit the patch.

Olga

-Original Message-
From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com] 
Sent: Thursday, December 03, 2009 8:13 AM
To: pig-dev@hadoop.apache.org
Subject: RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc
tables

Hi,

I've made 2 patches one for the Loader and another is the Unit Test.
It's not perfect yet but atleast this way people can start testing it
and give some inputs.

How do I submit the patch? I tried the SubmitPatch link but could not
attach the actual patch, then just ended up attaching it as a file.

Note that to run this you'll need to hive  the hive_exec.jar from hive
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/

Any help on how to integrate this with the ant build.xml will be
appreciated.

Cheers,
 Gerrit

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

20 matches

Site Navigation

Mail list logo

Footer information