from:"Dmitriy V. Ryaboy $JIRA$"

[jira] Updated: (PIG-1015) [piggybank] DateExtractor should take into account timezones

2009-10-11 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1015:
---

Fix Version/s: 0.6.0
   Status: Patch Available  (was: Open)

 [piggybank] DateExtractor should take into account timezones
 

 Key: PIG-1015
 URL: https://issues.apache.org/jira/browse/PIG-1015
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.6.0

 Attachments: date_extractor.patch


 The current implementation defaults to the local timezone when parsing 
 strings, thereby providing inconsistent results depending on the settings of 
 the computer the program is executing on (this is causing unit test 
 failures). We should set the timezone to a consistent default, and allow 
 users to override this default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-868) indexof / lastindexof / lower / replace / substring udf's

2009-10-11 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764533#action_12764533
 ] 

Dmitriy V. Ryaboy commented on PIG-868:
---

The dateExtractor issue is addressed by PIG-1015 ; just changing the testcase 
is not sufficient, as the testcase will still break in some parts of the world 
because it relies on local settings.

 indexof / lastindexof / lower / replace / substring udf's
 -

 Key: PIG-868
 URL: https://issues.apache.org/jira/browse/PIG-868
 Project: Pig
  Issue Type: New Feature
Reporter: Bennie Schut
Priority: Trivial
 Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch


 We parse some apache logs using pig and are using some pretty simple udf's 
 like this:
 B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, 
 '.txt')) as lang;
 It's pretty simple stuff but I figured someone else might find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-990) Provide a way to pin LogicalOperator Options

2009-10-01 Thread Dmitriy V. Ryaboy (JIRA)

Provide a way to pin LogicalOperator Options


 Key: PIG-990
 URL: https://issues.apache.org/jira/browse/PIG-990
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.6.0


This is a proactive patch, setting up the groundwork for adding an optimizer.

Some of the LogicalOperators have options. For example, LOJoin has a variety of 
join types (regular, fr, skewed, merge), which can be set by the user or chosen 
by a hypothetical optimizer.  If a user selects a join type, pig philoophy 
guides us to always respect the user's choice and not explore alternatives.  
Therefore, we need a way to pin options.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-09-30 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761070#action_12761070
]

Dmitriy V. Ryaboy commented on PIG-984:
---

Good idea.

It should be straightforward to look at the sort info associated with the
ResourceSchema (see the load/store proposal) to know whether the data is
sorted; this frees us from relying on loaders, lets us follow ORDER BYs and
LIMITs, etc.

Still, this is not quite safe unless you know that the distribution key is a
subset of your group key. A simple sorted input stream can still be split
among mappers with some rows with the same key going to one, and some to the
other. Do you have thoughts on how to handle such cases?

This is something that can be inferred looking at the schema and distribution
key. I understand wanting a manual handle to turn on the behavior while
developing, but the production version of this can be done automatically ( if
distributed by and sorted on a subset of group keys, apply map-side group rule
in the optimizer).

PERFORMANCE: Implement a map-side group operator to speed up processing of
ordered data

Key: PIG-984
URL: https://issues.apache.org/jira/browse/PIG-984
Project: Pig
Issue Type: New Feature
Reporter: Richard Ding

The general group by operation in Pig needs both mappers and reducers (the
aggregation is done in reducers). This incurs disk writes/reads between
mappers and reducers.
However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is
sorted by the keys).
2. The records with the same key are in the same mapper input.
the group by operation can be performed in the mappers only and thus remove
the overhead of disk writes/reads.
Alan proposed adding a hint to the group by clause like this one:
{code}
A = load 'input' using SomeLoader(...);
B = group A by $0 using mapside;
C = foreach B generate ...
{code}
The proposed addition of using mapside to group will be a mapside group
operator that collects all records for a given key into a buffer. When it
sees a key change it will emit the key and bag for records it had buffered.
It will assume that all keys for a given record are collected together and
thus there is not need to buffer across keys.
It is expected that SomeLoader will be implemented by data systems such as
Zebra to ensure the data emitted by the loader satisfies the above properties
(1) and (2).
It will be the responsibility of the user (or the loader) to guarantee these
properties (1) (2) before invoking the mapside hint for the group by
clause. The Pig runtime can't check for the errors in the input data.
For the group by clauses with mapside hint, Pig Latin will only support group
by columns (including *), not group by expressions nor group all.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-21 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757979#action_12757979
]

Dmitriy V. Ryaboy commented on PIG-966:
---

The comments below are from both me and Ashutosh.

We'd like to preface this with saying that we think overall, the proposed
changes are very useful and important, and are likely to result in
significantly reducing the barriers to Pig adoption in the broader Hadoop user
community.

There is a lot of suggestions / critiques below, but that's just because we
care :-)

On to the notes.

*Names of interfaces*

Can you explain why everything has a Load prefix? Seems like this limits the
interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata
does not represent metadata associated with loading -- it loads metadata.
LoadStatistics does not load statistics; it represents statistics, and is
loaded using LoadMetadata).

How about:
LoadCaster - PigTypeCaster,

LoadPushDown - Filterable, Projectionable (the latter may need a better name)
(clearly, we are also suggesting breaking down the interface into multiple
interfaces -- more on that later)

LoadSchema - ResourceSchema
LoadFieldSchema - FieldSchema or ResourceFieldSchema
LoadMetadata - MetadataReader
StoreMetadata - MetadataWriter
LoadStatistics - ResourceStatistics

*LoadFunc*

In regards to the appropriate parameters for setURI -- can you explain the
advantage of this over Strings in more detail? I think the current setLocation
approach is preferable; it gives users more flexibility. Plus Hadoop Paths are
constructed from strings, not URIs, so we are forcing a string-uri-string
conversion on the common case.

The _getLoadCaster_ method -- perhaps _getTypeCaster_ or _getPigTypeCaster_ is
a better name?

_prepareToRead_: does it need a _finishReading()_ mate?

I would like to see a standard method for getting the jobconf (or whatever it
is called in 20/21), both for LoadFunc and StoreFunc.

*LoadCaster (Or PigTypeCaster..)*

This interface is implemented by UTF8StorageConverter. Let's decide on what
these are -- _casters_ or _converters_ -- and use one term.

*LoadMetadata (or MetadataLoader)*

Some thoughts on the problem of what happens when the loader is loading
multiple resources or a resource with multiple partitions.

We think that the schema should be uniform for everything a single instance of
a loader is responsible for loading (and the loader can fill in null or
defaults where appropriate if some resources are missing fields).

Statistics should be aggregated, since the collection of resources will be
treated as one (knowledge of relevant partitions would be used by a
Filterable/Projectionable/Pushdownable loader to push selections down, not, I
think, by downstream operators).

So we have two options.

In option 1, getStatistics would return a collection (lower c) of stats
associated with the resources that the loader is loading, perhaps as a Map of
String-ResourceStatistics. These would need to go through a stat aggregator of
some sort that would know how to deal with unifying statistics across multiple
resources in a generic way.

In option 2, getStatistics would be responsible for its own implementation of
aggregation, which would give it flexibility in terms of how such aggregation
is done. Since we don't expect many stat stores, this seems preferable to us,
as generic aggregation is going to be hard to get right).

(of course there is option 3, where we have a default stat aggregator class
that can be extended/overridden by individual MetadataLoaders, but I imagine
this would be a hard sell).

*LoadSchema (or ResourceSchema)*

Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as
an internal representation?

I like how sort information is handled here. Perhaps we can consider using this
approach instead of _SortInfo_ in PIG-953
If _PigSchema_ implements or contains _ResourceSchema_, _SortInfo_ will no
longer be needed.

_PartitionKeys_ aren't really part of schema; they are a storage/distribution
property. This should go into the Metadata and refer to the schema.

*LoadStatistics (or ResourceStatistics)*

Why the public fields? Not that I am a huge fan of getters and setters but I
sense findbugs warnings heading our way.

We need to account for some statistics being missing. What should numRecords be
set to if we don't know the number of records?

We can use Long and set it to null;
we can use a magic value (-1?);
we can wrap in a getter and throw an exception (ugh).

I had envisioned statistics as more of a key-value thing, with some keys
predefined in a separate class. So we would have:

{code}
ResourceStats.NUM_RECORDS
ResourceStats.SIZE_IN_BYTES
//etc
{code}

and to get the stats we would call
{code}
MyResourceStats.getLong(ResourceStats.NUM_RECORDS)

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-21 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758037#action_12758037
]

Dmitriy V. Ryaboy commented on PIG-966:
---

Hi Alan,
Responses to responses:

bq. Perhaps its best to leave this as strings but look for a scheme at the
beginning and interpret it as a URI if it has one (which is what Pig does now).

I understand the motivation more clearly now, thanks for the explanation.
Agreed with the quoted approach.

bq. [regarding single schema for partitioned datasets] Agreed, that is what I
was trying to say. Perhaps it wasn't clear.

Nope, it was clear, I just have a very verbose way of saying yes.

Regarding merging the Schemas you said:
bq. No. It serves a different purpose, which is to define the content of data
flows inside the logical plan. We should not tie these two together.

I don't really understand the difference, but accept your superior knowledge of
the codebase and accept your decision :-).

bq. I'm not inclined to bend my programming style to match that of whoever
wrote findbugs.

+9.3 from the Russian judge. Gleefully accepted.

bq. We need partition keys as part of this interface, as Pig will need to be
able to pass partition keys to loaders that are capable of doing partition
pruning. So we could add getPartitionKeys to the LoadMetadata interface.

That's precisely what I am suggesting -- take it out of Schema, put it in
LoadMetadata (or MetadataReader, as I like to call it).

bq. The problem with key/value set ups like this is it can be hard for people
to understand what is already there. So they end up not using what already
exists, or worse, re-inventing the wheel. My hope is that by versioning this we
can get around the need for this key/value stuff.

Hm, I see your point. I am interested in being able to augment the set of
available statistics without requiring changes to the base classes, however. I
guess that's where inheritance comes in handy. Any comments on how to handle
missing data? Primitive types still don't work for that.

bq. So what happens tomorrow when some loaders can do merge joins on sorted
data? Now we have to have another interface. I want this to be easily
extensible.

I must not be clear on what pushing down to a loader does. My interpretation
was that it allows pushing down operations to the point where you don't read
unnecessary data off disk. A classic example of filter projection would be
filtering by a partition key (so, dt sysdate-30 , and our data is stored in
files one per day). An example of projection pushdown is when we have a column
store that simply avoids loading some of the columns.

I don't see how a loader can push down a join. That seems to require reading
and changing data. Is the idea that such a join can be performed without an MR
step? That seems like a Pig thing, not a loader thing.

In any case, yes, I think something like this would require a new interface in
the same namespace, since it's a drastically different capability.

Any thoughts on advisability of simplifying projection pushdown to just work on
an int array? I know it's limiting, but it's going to be a heck of a lot easier
for users to implement.

bq. I'm assuming that a given StoreFunc is tied to a particular metadata
instance, so it would return its implementation of StoreMetadata.

I was assuming that Pig would have a preferred metadata store (such as Owl),
and it would attempt to use it unless instructed otherwise. We could even try
some cascading thing: if the user specifies a metadata store on the command
line, use that; if not, see whether the loader suggests one; if not, use Owl;
if owl doesn't have anything, see if it's an file in a known scheme (hdfs,
file, s3n...) and at least get some file-level metadata such as create date and
size. StoreMetadata can do the same (except for hdfs part).

I'll take another look at PIG-967.

Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
---

Key: PIG-966
URL: https://issues.apache.org/jira/browse/PIG-966
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces
significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for
full details

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757310#action_12757310
]

Dmitriy V. Ryaboy commented on PIG-948:
---

I don't see a problem with url construction in Pig code. If Hadoop exposed
this, then sure, it would be better to use such a feature.

Since Hadoop does not expose it (afaik), it's more useful for the end-user to
have this url than to have a jobid. Maintenance on this piece of code is
minimal -- after all, it's just a simple string concatenation we are talking
about. If Hadoop changes how this url is constructed, it will take about 3
minutes to fix, 2.5 of which will be spent opening a Jira ticket.

In the meantime, users will have a more usable product than they would without
this one line of code.

[Usability] Relating pig script with MR jobs

Key: PIG-948
URL: https://issues.apache.org/jira/browse/PIG-948
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
Attachments: pig-948.patch

Currently its hard to find a way to relate pig script with specific MR job.
In a loaded cluster with multiple simultaneous job submissions, its not easy
to figure out which specific MR jobs were launched for a given pig script. If
Pig can provide this info, it will be useful to debug and monitor the jobs
resulting from a pig script.
At the very least, Pig should be able to provide user the following
information
1) Job id of the launched job.
2) Complete web url of jobtracker running this job.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753801#action_12753801
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep,

First, I think this is very important to have, not just for Merge but for other 
things that might benefit from knowing sort orders, as well. 

A few minor nits from a cursory glance at the code. I didn't check the actual 
logic very carefully yet -- it looks like the large diff blocks in MergeSort et 
al are mostly moves of code blocks, not significant code changes, correct?

On to the comments:

seekNear seems ambiguous, as near is a generic concept that does not 
necessarily imply before or to, but not after -- which is what this method is 
required to do. How about seekBefore()?

Why does getAscColumns and getSortColumns make a copy of the list?  Seems like 
we can save some memory and cpu here.

For that matter, why not use a map of (String)colName-gt; (Boolean)ascending 
instead of 2 lists? One structure, plus O(1) lookup.

Not sure about the use of super() in the constructor of a class that doesn't 
extend anything but Object. Is there some magic that requires it?

In Log2PhysTranslator, why hardcode the Limit operator? There are other 
operators that don't change sort order, such as filter. Perhaps add a method to 
Logical Operators that indicates if they alter sort order of their inputs?


in Utils

checkNullEquals is better written as

{code}
if (obj1 == null || obj2 == null) {
return obj1 == obj2;
} else  {
return checkEquality ? obj1.equals(obj2) : true;
}
{code}

Even with this rewrite, this seems like an odd function. It being as odd as it 
is leads to it not being used safely when you set checkEquality to false (just 
a few lines later)-- if obj1 is null and obj2 is not, the func returns true, 
you try to call a method on obj1, and get an NPE.

Probably better not to roll all this into one amorphous function and simply 
write

{code}
Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2)  obj1.equals(obj2));
{code}

(the implementations of bothNull and notNull are obvious -- just conjunction 
and disjunction of obj == null)

In StoreConfig
This comment has a typo (and instead of an): 
* 1) the store does not follow and order by



 Enable merge join in pig to work with loaders and store functions which can 
 internally index sorted data 
 -

 Key: PIG-953
 URL: https://issues.apache.org/jira/browse/PIG-953
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-953.patch


 Currently merge join implementation in pig includes construction of an index 
 on sorted data and use of that index to seek into the right input to 
 efficiently perform the join operation. Some loaders (notably the zebra 
 loader) internally implement an index on sorted data and can perform this 
 seek efficiently using their index. So the use of the index needs to be 
 abstracted in such a way that when the loader supports indexing, pig uses it 
 (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753833#action_12753833
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

I got my trues and falses reversed on the NPE thing. You are right, the 
function works as intended.
I still think it's too verbose, but agree that it's a style issue -- I guess if 
the commiters like it, it's fine :-)

 Enable merge join in pig to work with loaders and store functions which can 
 internally index sorted data 
 -

 Key: PIG-953
 URL: https://issues.apache.org/jira/browse/PIG-953
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-953.patch


 Currently merge join implementation in pig includes construction of an index 
 on sorted data and use of that index to seek into the right input to 
 efficiently perform the join operation. Some loaders (notably the zebra 
 loader) internally implement an index on sorted data and can perform this 
 seek efficiently using their index. So the use of the index needs to be 
 abstracted in such a way that when the loader supports indexing, pig uses it 
 (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-943) Pig crash when it cannot get counter from hadoop

2009-09-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12751631#action_12751631
 ] 

Dmitriy V. Ryaboy commented on PIG-943:
---

Hi Daniel,
My apologies, I worded my comment poorly. I wasn't minus-oneing the patch, I 
was saying that the use of -1 as a magic value is a bit hacky.
I think inserting Long.NaN or null and checking for it on the other end, 
instead of checking for -1, is cleaner.


 Pig crash when it cannot get counter from hadoop
 

 Key: PIG-943
 URL: https://issues.apache.org/jira/browse/PIG-943
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-943-1.patch


 We see following call stacks in Pig:
 Case 1:
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.computeWarningAggregate(MapReduceLauncher.java:390)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 Case 2:
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:150)
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:91)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:192)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 In both cases, hadoop jobs finishes without error. The cause of both problems 
 is RunningJob.getCounters() returns a null, and Pig do not currently check 
 for that. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-936) making dump and PigDump independent from Tuple.toString

2009-08-30 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749255#action_12749255
 ] 

Dmitriy V. Ryaboy commented on PIG-936:
---

Patch makes sense.
pig.data doesn't seem like the right package for this class -- perhaps 
pig.tools ?

Also please make sure to format your code in accordance with the style 
guidelines (http://java.sun.com/docs/codeconv/ ), and use 4 spaces -- not tabs 
-- for indentation.  

 making dump and PigDump independent from Tuple.toString
 ---

 Key: PIG-936
 URL: https://issues.apache.org/jira/browse/PIG-936
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Fix For: 0.4.0


 Since Tuple is an interface, a toString implementation can change from one 
 tuple implementation to the next. This means that format of dump and PigDump 
 will be different depending on the tuples processed. This could be quite 
 confusing to the users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-29 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749196#action_12749196
 ] 

Dmitriy V. Ryaboy commented on PIG-934:
---

Throwing an exception when a seek is past the file boundary seems acceptable to 
me (and preferable to adding new functions and changing upstream code that 
shouldn't care about this detail). Especially since if there is a way to get a 
consistent ordering among files in a directory, it's trivial to later update 
this code to seek past file boundaries and into the next file.

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-20 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745518#action_12745518
]

Dmitriy V. Ryaboy commented on PIG-924:
---

Owen -- I may not have made the intent clear; the idea is that when Pig is
rewritten to use the future-proofed APIs, the shims will go away (presumably
for 0.5). Right now, pig is not using the new APIs, even the 20 patch posted
by Olga uses the deprecated mapred calls.

This is only to make life easier in the transitional period while Pig is using
the old, mutating APIs.

Check out the pig user list archives for motivation of why these shims are
needed.

Make Pig work with multiple versions of Hadoop
--

Key: PIG-924
URL: https://issues.apache.org/jira/browse/PIG-924
Project: Pig
Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch

The current Pig build scripts package hadoop and other dependencies into the
pig.jar file.
This means that if users upgrade Hadoop, they also need to upgrade Pig.
Pig has relatively few dependencies on Hadoop interfaces that changed between
18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to
use the correct calls for any of the above versions of Hadoop. Unfortunately,
the building process precludes us from the ability to do this at runtime, and
forces an unnecessary Pig rebuild even if dynamic shims are created.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-19 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745109#action_12745109
 ] 

Dmitriy V. Ryaboy commented on PIG-924:
---

Regarding deprecation -- I tried setting it back to off, and adding 
@SuppressWarnings(deprecation) to the shims for 20, but and complained about 
deprecation nonetheless. Not sure what its deal is.

Adding something like this to the main build.xml works. Does this seem like a 
reasonable solution?

{code}
!-- set deprecation off if hadoop version greater or equals 20 --
target name=set_deprecation
  condition property=hadoop_is20
equals arg1=${hadoop.version} arg2=20/
  /condition
  antcall target=if_hadoop_is20/
  antcall target=if_hadoop_not20/
/target
target name=if_hadoop_is20 if=hadoop_is20
  property name=javac.deprecation value=off /
/target
target name=if_hadoop_not20 unless=hadoop_is20
  property name=javac.deprecation value=on /
/target


target name=init depends=set_deprecation
  []
{code}

 Make Pig work with multiple versions of Hadoop
 --

 Key: PIG-924
 URL: https://issues.apache.org/jira/browse/PIG-924
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch


 The current Pig build scripts package hadoop and other dependencies into the 
 pig.jar file.
 This means that if users upgrade Hadoop, they also need to upgrade Pig.
 Pig has relatively few dependencies on Hadoop interfaces that changed between 
 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
 use the correct calls for any of the above versions of Hadoop. Unfortunately, 
 the building process precludes us from the ability to do this at runtime, and 
 forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-18 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-924:
--

Status: Patch Available  (was: Open)

 Make Pig work with multiple versions of Hadoop
 --

 Key: PIG-924
 URL: https://issues.apache.org/jira/browse/PIG-924
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_924.2.patch, pig_924.patch


 The current Pig build scripts package hadoop and other dependencies into the 
 pig.jar file.
 This means that if users upgrade Hadoop, they also need to upgrade Pig.
 Pig has relatively few dependencies on Hadoop interfaces that changed between 
 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
 use the correct calls for any of the above versions of Hadoop. Unfortunately, 
 the building process precludes us from the ability to do this at runtime, and 
 forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-18 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-924:
--

Attachment: pig_924.2.patch

This patch addresses the reviewer comments.
I put the factor of 0.9 into the 18 shim to restore old behavior (not sure what 
the motivation was for changing this for 20.. 

I set the default hadoop version to 18, so that we can verify correctness by 
running automated tests.

The existing unit tests are sufficient verification of this patch (at least as 
far as 18 is concerned).

 Make Pig work with multiple versions of Hadoop
 --

 Key: PIG-924
 URL: https://issues.apache.org/jira/browse/PIG-924
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_924.2.patch, pig_924.patch


 The current Pig build scripts package hadoop and other dependencies into the 
 pig.jar file.
 This means that if users upgrade Hadoop, they also need to upgrade Pig.
 Pig has relatively few dependencies on Hadoop interfaces that changed between 
 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
 use the correct calls for any of the above versions of Hadoop. Unfortunately, 
 the building process precludes us from the ability to do this at runtime, and 
 forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-18 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-924:
--

Status: Patch Available  (was: Open)

 Make Pig work with multiple versions of Hadoop
 --

 Key: PIG-924
 URL: https://issues.apache.org/jira/browse/PIG-924
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch


 The current Pig build scripts package hadoop and other dependencies into the 
 pig.jar file.
 This means that if users upgrade Hadoop, they also need to upgrade Pig.
 Pig has relatively few dependencies on Hadoop interfaces that changed between 
 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
 use the correct calls for any of the above versions of Hadoop. Unfortunately, 
 the building process precludes us from the ability to do this at runtime, and 
 forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-923) Allow setting logfile location in pig.properties

2009-08-17 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-923:
--

Status: Patch Available  (was: Open)

 Allow setting logfile location in pig.properties
 

 Key: PIG-923
 URL: https://issues.apache.org/jira/browse/PIG-923
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.4.0

 Attachments: pig_923.patch


 Local log file location can be specified through the -l flag, but it cannot 
 be set in pig.properties.
 This JIRA proposes a change to Main.java that allows it to read the 
 pig.logfile property from the configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-17 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744307#action_12744307
]

Dmitriy V. Ryaboy commented on PIG-924:
---

Thanks for looking, Todd -- most of those changes, like the factor of 0.9,
deprecation, excluding HBase test, etc, are consistent with the 0.20 patch
posted to PIG-660 .
Moving junit.hadoop.conf is critical -- there are comments about this in 660 --
without it, resetting hadoop.version doesn't actually work, as some of the
information from a previous build sticks around.

I'll fix the whitespace; this wasn't a final patch, more of a proof of concept.
Point being this could work, but it can't, because Hadoop is bundled in the
jar. I am looking for comments from the core developer team regarding the
possibility of un-bundling.

Make Pig work with multiple versions of Hadoop
--

Key: PIG-924
URL: https://issues.apache.org/jira/browse/PIG-924
Project: Pig
Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Attachments: pig_924.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-17 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-911:
--

Status: Open  (was: Patch Available)

 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_911.2.patch, pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-17 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-911:
--

Attachment: pig_911.2.patch

Addressed Alan's comments.

 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_911.2.patch, pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-17 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744343#action_12744343
 ] 

Dmitriy V. Ryaboy commented on PIG-911:
---

Concerning making this a StoreFunc, as well -- the StoreFunc interface is not 
very friendly to this.
All you get in the bind call is the output stream; for LoadFunc, you also get 
the name of the file (or, presumably, whatever it was the user passed in under 
the guise of a file name).  This means that for the LoadFunc, I was able to use 
the passed in filename to back into a Path and a FileSystem.  I can't do the 
same for StoreFunc, where the filename is not available -- only the output 
stream is.  That means I can't create the appropriate SequenceFile.Writer .  Is 
there a way around this limitation that does not involve requiring special 
constructor parameters to be used?  
Is it possible to change the StoreFunc api to provide this information, or to 
make it available through some side channel (MapRedUtils or similar)?

 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_911.2.patch, pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

2009-08-12 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742562#action_12742562
]

Dmitriy V. Ryaboy commented on PIG-845:
---

Alan, Ashutosh -- maybe I am misunderstanding where null keys come from in the
Indexer. I assumed this was due to the processing that happens in the plan the
indexer deserializes and attaches to its POLocalRearrange.

In regards to errors, I was referring to this:
{code}
catch(PlanException e){
int errCode = 2034;
String msg = Error compiling operator +
joinOp.getClass().getCanonicalName();
throw new MRCompilerException(msg, errCode, PigException.BUG, e);
{code}

The only central place for error codes seems to be the Wiki. A class with a
bunch of static+final error codes would be a better place.

Ashutosh, I completely disagree with you on changing all tests to run in MR
mode. The tests are already impossible to run on a laptop (people, myself
included, actually submit patches to jira just to see if tests pass). Running
in MR mode will incur significant overhead per test. Only things that actually
rely on the MR bits should be tested in MR (and use mock objects if possible..
there's been some advancement on that front in Hadoop 20, I haven't looked at
it yet).

Would love to see a more efficient indexing MR job (which will reduce load on
the JT, keep schedules less busy, and incur less overhead in task startups by
requiring fewer tasks), but perhaps not before 0.4 is out the door with
existing functionality. Just to be clear, I don't think more than 1 record per
block is necessary, but more than one block per task would probably be a good
thing.

Any thoughts on how to choose which of two relations to index? We get locality
on the non-indexed relation, but not on the indexed one, which probably throws
a kink in the normal way of thinking about this.

PERFORMANCE: Merge Join
---

Key: PIG-845
URL: https://issues.apache.org/jira/browse/PIG-845
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
Attachments: merge-join.patch

Thsi join would work if the data for both tables is sorted on the join key.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-12 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742565#action_12742565
 ] 

Dmitriy V. Ryaboy commented on PIG-911:
---

Alan, 
Thanks for the feedback.

I'll add the try/catch

In regards to the UTF8StorageConverter -- I think I added that because before 
that the code broke if you didn't declare a schema at load time (so, a=load 
'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using 
SequenceFileLoader() as (a:chararray, b:double)

I'll figure out what exactly is going on with that and remove the 
UTF8StorageConverter 

Will add Store as time allows.



 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

2009-08-11 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742083#action_12742083
 ] 

Dmitriy V. Ryaboy commented on PIG-833:
---

Alan -- if it's not finding .dfs , it's probably not linking hadoop20.jar

Try my patch in 660 :-)

 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang
 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
 PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
 TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz


 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-833) Storage access layer

2009-08-11 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742170#action_12742170
 ] 

Dmitriy V. Ryaboy commented on PIG-833:
---

Alan, this means Pig contrib/ is no longer compatible with Hadoop 18.
Which probably means that you need to either rolls this back or roll 660 in 
(and add the hadoop20.jar file to lib/ )
Otherwise the build is broken.

 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang
 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
 PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
 TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz


 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

2009-08-10 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741589#action_12741589
 ] 

Dmitriy V. Ryaboy commented on PIG-845:
---

Some Comments below.
It's a big patch, so a lot of comments...

1. 
EndOfAllInput flags -- could you add comments here about what the point of this 
flag is? You explain what EndOfAllInputSetter does (which is actually rather 
self-explanatory) but not what the meaning of the flag is and how it's used. 
There is a bit of an explanation in PigMapBase, but it really belongs here.

2.
Could you explain the relationship between EndOfAllInput and (deleted) POStream?

3.
Comments in MRCompiler alternate between referring to the left MROp as 
LeftMROper and curMROper. Choose one.

4.
I am curious about the decision to throw compiler exceptions if MergeJoin 
requirements re number of inputs, etc, aren't satisfied. It seems like a better 
user experience would be to log a warning and fall back to a regular join.

5.
Style notes for visitMergeJoin: 

It's a 200-line method. Any way you can break it up into smaller components? As 
is, it's hard to follow.

The if statements should be broken up into multiple lines to agree with the 
style guides.

Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they 
are fine, but together in a 200-line method they are undreadable. Please 
consider more descriptive names.

6.
Kind of a global comment, since it applies to more than just MergeJoin:

It seems to me like we need a Builder for operators to clean up some of the 
new, set, set, set stuff.

Having the setters return this and a Plan's add() method return the plan, would 
let us replace this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
topPrj.setColumn(1);
topPrj.setResultType(DataType.TUPLE);
topPrj.setOverloaded(true);
rightMROpr.reducePlan.add(topPrj);
rightMROpr.reducePlan.connect(pkg, topPrj);

with this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)))
.setColumn(1).setResultType(DataType.TUPLE)
.setOverloaded(true);

rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj)


7.
Is the change to ListListByte keyTypes in POFRJoin related to MergeJoin or 
just rolled in?

8. MergeJoin

break getNext() into components.

I don't see you supporting Left outer joins. Plans for that? At least document 
the planned approach.

Error codes being declared deep inside classes, and documented on the wiki, is 
a poor practice, imo. They should be pulled out into PigErrors (as lightweight 
final objects that have an error code, a name, and a description..) I thought 
Santhosh made progress on this already, no?

Could you explain the problem with splits and streams? Why can't this work for 
them?


9. Sampler/Indexer:
9a
Looks like you create the same number of map tasks for this as you do for a 
join; all a sampling map task does is read one record and emit a single tuple.  
That seems wasteful; there is a lot of overhead in setting up these tiny jobs 
which might get stuck behind other jobs running on the cluster, etc. If the 
underlying file has syncpoints, a smaller number of MR tasks can be created. If 
we know the ratio of sample tasks to full tasks, we can figure out how many 
records we should emit per job ( ceil(full_tasks/sample_tasks) ).  We can 
approximately achieve this through seeking trough (end-offset)/num_to_emit and 
doing a sync() after that seek. It's approximate, but close enough for an index.

9b
Consider renaming to something like SortedFileIndexer, since it's coneivable 
that this component can be reused in a context other than a Merge Join.

10.
Would it make sense to expose this to the users via a 'CREATE INDEX' (or 
similar) command?
That way the index could be persisted, and the user could tell you to use an 
existing index instead of rescanning the data.

11.
I am not sure about the approach of pushing sampling above filters. Have you 
guys benchmarked this? Seems like you'd wind up reading the whole file in the 
sample job if the filter is selective enough (and high filter selectivity would 
also make materialize-sample go much faster).

Testing: 
12a
You should test for refusal to do 3-way join and other error condition (or a 
warning and successful failover to regular join -- my preference)

12b
You should do a proper unit test for the MergeJoinIndexer (or whatever we are 
calling it).



 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Attachments: merge-join-1.patch, merge-join-for-review.patch


 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is

[jira] Commented: (PIG-561) Need to generate empty tuples and bags as a part of Pig Syntax

2009-08-09 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741133#action_12741133
 ] 

Dmitriy V. Ryaboy commented on PIG-561:
---

I believe PIG-773 fixes this. Can we close this?

 Need to generate empty tuples and bags as a part of Pig Syntax
 --

 Key: PIG-561
 URL: https://issues.apache.org/jira/browse/PIG-561
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Viraj Bhat

 There is a need to sometimes generate empty tuples and bags as a part of the 
 Pig syntax rather than using UDF's
 {code}
 a = load 'mydata.txt' using PigStorage();
 b =foreach a generate ( ) as emptytuple;
 c = foreach a generate { } as emptybag;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-660) Integration with Hadoop 0.20

2009-08-06 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740241#action_12740241
 ] 

Dmitriy V. Ryaboy commented on PIG-660:
---

The shim patch posted above doesn't work as cleanly as desired; the current 
build.xml has junit.hadoop.conf points to a directory in ${user.home}

This has an undesired effect -- a hadoop config file gets created the first 
time you run ant, which among other things sets what class implements the 
FileSytem interface. When ant gets re-run with a different hadoop version, 'ant 
clean' does not clean out this file -- so an incorrect fs class name gets used. 
 Deleting the directory created by junit.hadoop.conf before rerunning fixes the 
problem; so does putting the value of junit.hadoop.conf relative to 
${build.dir} instead of ${user.home}.  

As I am not sure how the Y! developers use their pigconf directories this thing 
references, I do not know the appropriate way to proceed. Comments?

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, 
 PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, 
 PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-08-06 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-660:
--

Attachment: pig_660_shims_3.patch

The attached patch fixes the mentioned issue with junit.hadoop.conf by setting
it to $build.dir/conf
This can be overridden by build.properties if individual contributors want to
revert to the old behavior.

Also added a compatibility shim for hadoop19 (from PIG-573)

Integration with Hadoop 0.20

Key: PIG-660
URL: https://issues.apache.org/jira/browse/PIG-660
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.2.0
Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Fix For: 0.4.0

Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch,
PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch,
PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch,
pig_660_shims_3.patch

With Hadoop 0.20, it will be possible to query the status of each map and
reduce in a map reduce job. This will allow better error reporting. Some of
the other items that could be on Hadoop's feature requests/bugs are
documented here for tracking.
1. Hadoop should return objects instead of strings when exceptions are thrown
2. The JobControl should handle all exceptions and report them appropriately.
For example, when the JobControl fails to launch jobs, it should handle
exceptions appropriately and should support APIs that query this state, i.e.,
failure to launch jobs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-660) Integration with Hadoop 0.20

2009-08-06 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740339#action_12740339
]

Dmitriy V. Ryaboy commented on PIG-660:
---

Nate,
Your stacktrace shows hadoop.dfs calls (as opposed to hdfs) which tells me it's
looking for -- and finding -- hadoop 18 classes.

Can you do this:

export PIG_HADOOP_VERSION=20
ant clean; ant -Dhadoop.version=20

any try again?

Just to be sure, try moving hadoop1* out of the lib directory (so that it for
sure fails if it's trying to look for 18).

Integration with Hadoop 0.20

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-08-05 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739568#action_12739568
 ] 

Dmitriy V. Ryaboy commented on PIG-893:
---

Jeff,
Thanks for the contribution!
Just a few comments:

0) could you name your patch files *.patch? That makes them easier to review, 
as the proper highlighting mode is chosen.

1) Other class names in the utils package imply that the class name for this 
should be CastUtils

2) Spacing in POCast.java is a bit messed up. Please make sure all spacing is 
to project conventions

3) In TestSchema -- Numberic isn't a word, you mean Numeric (no b)

4) I am not sure about naming the methods chararrayTo . Since they take 
String as an argument, being in Java-land, I think it would be more 
straightforward to say stringToxxx .

5) Implementation of the casts -- you call str.toBytes(), and hand off to 
bytesToXXX  method. That method, in turn, converts bytes back into a string, 
and proceeds to do the conversion. That seems like redundant work. Wouldn't it 
be better to have stringToXXX peform the conversion, and have bytesToXXX covert 
to string, then call the stringToXXX method?

6)  TestCharArray2Numeric.java -- the convention is to spell out To instead 
of using the number 2

7)  The tests in TestCharArray2Numeric look very similar to each other. Could 
you pull out the common functionality so the code is only repeated once?  
About the tests themselves:  Since you are just testing conversions, this can 
be a straightforward unit test -- make a few strings, assert that they convert 
to the expected value. Hit the edge cases (overflows, special cases for 
parsing, etc).  We don't need to spin up a whole Pig query.

8) I don't like testing random values, as this creates tests that might 
sometimes pass, and sometimes not. Recommend using known data for reproducible 
test results.

9) You extracted functionality from Utf8StorageConverter by duplicating the 
code; I would prefer to see Utf8StorageConverter modified to hand off 
conversions to CastUtils



 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Thejas M Nair
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_893_Patch.txt


 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-08-05 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-660:
--

Attachment: pig_660_shims_2.patch

Sure is.. uploading a patch with the fixed package name. 

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, 
 PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, 
 PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-05 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739570#action_12739570
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Sorry I am being slow -- which libraries are missing from the classpath you 
posted?

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-05 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739643#action_12739643
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Oh I see.
I have this in my bashrc:

export PIG_CLASSPATH=$PIGDIR/pig.jar

I thought this was included in a README somewhere. I guess we can modify 
bin/pig to use this as a default value (so a user can still override by setting 
PIG_CLASSPATH to something else).

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-05 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-911:
--

Attachment: pig_sequencefile.patch

The attached patch is an initial implementation of a loader for SequenceFiles.

It works with keys and values of the following types:
Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, 
BooleanWritable, ByteWritable

I would appreciate some comments on how to properly handle errors (casting 
errors, IO errors, etc).


 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-908) Need a way to correlate MR jobs with Pig statements

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)

Need a way to correlate MR jobs with Pig statements
---

 Key: PIG-908
 URL: https://issues.apache.org/jira/browse/PIG-908
 Project: Pig
  Issue Type: Wish
Reporter: Dmitriy V. Ryaboy


Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
recent introduction of multi-store capabilities.
For example, the first script in the Pig tutorial produces 5 MR jobs.

There is currently very little support for debugging resulting jobs; if one of 
the MR jobs fails, it is hard to figure out which part of the script it was 
responsible for. Explain plans help, but even with the explain plan, a fair 
amount of effort (and sometimes, experimentation) is required to correlate the 
failing MR job with the corresponding PigLatin statements.

This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739125#action_12739125
]

Dmitriy V. Ryaboy commented on PIG-908:
---

An idea for something might work (haven't evaluated the complexity of
implementing this)

When LogicalOperators are created, a bit of metadata is attached to them,
listing the line number that they come from. Multiple LOs may be created from
a single line, and multiple lines may be associated with a single operator.

This metadata is passed down to Physical Operators.

When an MR job is created, a log message is written listing the line numbers
that are associated with the POs in this map-reduce job, and the job name.

Thoughts?

Need a way to correlate MR jobs with Pig statements
---

Key: PIG-908
URL: https://issues.apache.org/jira/browse/PIG-908
Project: Pig
Issue Type: Wish
Reporter: Dmitriy V. Ryaboy

Complex Pig Scripts often generate many Map-Reduce jobs, especially with the
recent introduction of multi-store capabilities.
For example, the first script in the Pig tutorial produces 5 MR jobs.
There is currently very little support for debugging resulting jobs; if one
of the MR jobs fails, it is hard to figure out which part of the script it
was responsible for. Explain plans help, but even with the explain plan, a
fair amount of effort (and sometimes, experimentation) is required to
correlate the failing MR job with the corresponding PigLatin statements.
This ticket is created to discuss approaches to alleviating this problem.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: pig_909.patch

The attached patch modifies bin/pig as described.

Tested locally by setting and unsetting HADOOP_HOME and making sure the right 
configurations, etc, are picked up.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: pig_909.2.patch

added ivy jars to classpath

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739287#action_12739287
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Daniel, not sure what you mean.
Do you mean that the patch makes it necessary to have an external version of 
hadoop to build/run pig?
That's not the case, as I wrapped the whole thing in an if -- external hadoop 
jars will only be used instead of the bundled hadoop.jar if HADOOP_HOME is 
defined (and valid).

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739297#action_12739297
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Actually I looked at build.xml for pig, and it includes the Ivy dependencies in 
pig.jar

Which explains why this stuff has been working for me.

I'll delete the second patch -- that change is unnecessary.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: (was: pig_909.2.patch)

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-08-04 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-660:
--

Attachment: pig_660_shims.patch

Attached patch, pig_660_shims.patch, introduces an compatibility layer similar
to that in https://issues.apache.org/jira/browse/HIVE-487 . HadoopShims.java
contains wrappers that hide interface differences between Hadoop 18 and 20;
when an interface change affects Pig, a shim is added into this class, and used
by Pig.

Separate versions of the shims are maintained for different Hadoop versions.

This way, Pig users can compile against either Hadoop 18 or Hadoop 20 by simply
changing an ant property, either via the -D flag, or build.properties, instead
of having to go through the process of patching.

There has been discussion of officially moving Pig to 0.20; this way, we
sidestep the whole question, and only need to worry about version compatibility
when using specific Hadoop APIs.

I propose that we use this mechanism until Pig is moved to use the new,
future-proofed API.

Pig compiled against 18 won't be able to use some of the newest features, such
as Zebra storage. Ant can be configured not to build ant if Hadoop version is
20.

Integration with Hadoop 0.20

Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch,
PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch,
PIG-660_5.patch, pig_660_shims.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-903) ILLUSTRATE fails on 'Distinct' operator

2009-08-03 Thread Dmitriy V. Ryaboy (JIRA)

ILLUSTRATE fails on 'Distinct' operator
---

 Key: PIG-903
 URL: https://issues.apache.org/jira/browse/PIG-903
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy


Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the 
tutorial script script1-hadoop.pig works fine.

However, executing the following illustrate command throws an exception:

illustrate ngramed2

Pig Stack Trace
---
ERROR 2999: Unexpected internal error. Unrecognized logical operator.

java.lang.RuntimeException: Unrecognized logical operator.
at 
org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60)
at 
org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368)
at 
org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226)
at 
org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104)
at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98)
at 
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90)
at 
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106)
at org.apache.pig.PigServer.getExamples(PigServer.java:724)
at 
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:361)


This works:
illustrate ngramed1;

Although it does throw a few NPEs :

java.lang.NullPointerException
at 
org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205)
at 
org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
at 
org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86)
[...]

(illustrate also doesn't work on bzipped input, but that's a separate issue)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-07-27 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735676#action_12735676
 ] 

Dmitriy V. Ryaboy commented on PIG-893:
---

+1 for string-numeric conversion via casting.  

 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Reporter: Thejas M Nair

 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-07-20 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-660:
--

Attachment: PIG-660_5.patch

Updating the patch to set PIG_HADOOP_VERSION to 20 by default.

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, 
 PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729751#action_12729751
]

Dmitriy V. Ryaboy commented on PIG-879:
---

Having this be a global flag through properties wouldn't work for scripts that
require both behaviors in different load statements.

Maybe a boolean performPathConversion flag which is true by default, and can
be overridden via the load statement?
Custom Loaders could change what their default is.
I think a boolean flag is more straightforward than a method you have to
override with a no-op.

Pig should provide a way for input location string in load statement to be
passed as-is to the Loader
-

Key: PIG-879
URL: https://issues.apache.org/jira/browse/PIG-879
Project: Pig
Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

Due to multiquery optimization, Pig always converts the filenames to
absolute URIs (see
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section
about Incompatible Changes - Path Names and Schemes). This is necessary since
the script may have cd .. statements between load or store statements and
if the load statements have relative paths, we would need to convert to
absolute paths to know where to load/store from. To do this
QueryParser.massageFilename() has the code below[1] which basically gives the
fully qualified hdfs path

However the issue with this approach is that if the filename string is
something like
hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
the code below[1] actually translates this to
hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
and throws an exception that it is an incorrect path.

Some loaders may want to interpret the filenames (the input location string
in the load statement) in any way they wish and may want Pig to not make
absolute paths out of them.

There are a few options to address this:
1)A command line switch to indicate to Pig that pathnames in the script
are all absolute and hence Pig should not alter them and pass them as-is to
Loaders and Storers.
2)A keyword in the load and store statements to indicate the same intent
to pig
3)A property which users can supply on cmdline or in pig.properties to
indicate the same intent.
4)A method in LoadFunc - relativeToAbsolutePath(String filename, String
curDir) which does the conversion to absolute - this way Loader can chose to
implement it as a noop.
Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-863) Function (UDF) automatic namespace resolution is really needed

2009-06-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723683#action_12723683
 ] 

Dmitriy V. Ryaboy commented on PIG-863:
---

I believe PIG-832  addresses this 

 Function (UDF) automatic namespace resolution is really needed
 --

 Key: PIG-863
 URL: https://issues.apache.org/jira/browse/PIG-863
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz

 The Apache PiggyBank documentation says that to reference a function, I need 
 to specify a function as:
 org.apache.pig.piggybank.evaluation.string.UPPER(text)
 As in the example:
 {code}
 REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
 TweetsInaug  = FILTER Tweets BY 
 org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES 
 '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
 {code}
 Why can't we implement automatic name space resolution as so we can just 
 reference UPPER without namespace qualifiers?
 {code}
 REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
 TweetsInaug  = FILTER Tweets BY UPPER(text) MATCHES 
 '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
 {code}
 I know about the workaround:
 {code}
 define org.apache.pig.piggybank.evaluation.string.UPPER UPPER
 {code}
 But this is really a pain to do if I have lots of functions.
 Just warn if there is a collision and suggest I use the define workaround 
 in the warning messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-855) Filter to determine if a UserAgent string is a bot

2009-06-17 Thread Dmitriy V. Ryaboy (JIRA)

Filter to determine if a UserAgent string is a bot
--

 Key: PIG-855
 URL: https://issues.apache.org/jira/browse/PIG-855
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Priority: Minor


A PiggyBank contrib that would allow one to filter records by whether a 
UserAgent strings represents a bot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-855) Filter to determine if a UserAgent string is a bot

2009-06-17 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721012#action_12721012
 ] 

Dmitriy V. Ryaboy commented on PIG-855:
---

Jeff, the approach depends on whether you care more about false positives or 
false negatives.

The right way to do this is probably not to write a boolean function, but  
something that returns one of several codes -- known browser, known crawler, 
monitor,  stuff like wget and curl, and unknown.

IAB has a standard list of bots and spiders 
(http://www.iab.net/sites/login.php), and maintains an industry standard for 
the filters that should be applied before numbers are reported.  

 Filter to determine if a UserAgent string is a bot
 --

 Key: PIG-855
 URL: https://issues.apache.org/jira/browse/PIG-855
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Priority: Minor

 A PiggyBank contrib that would allow one to filter records by whether a 
 UserAgent strings represents a bot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-830:
--

Status: Patch Available  (was: Open)

 Port Apache Log parsing piggybank contrib to Pig 0.2
 

 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt


 The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
 pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
 merged in.
 They should be updated to work with the current APIs and added back into 
 trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-02 Thread Dmitriy V. Ryaboy (JIRA)

Port Apache Log parsing piggybank contrib to Pig 0.2


 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Priority: Minor


The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, pig-487, 
pig-488, pig-503, pig-509) got dropped after the types branch was merged in.
They should be updated to work with the current APIs and added back into trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-02 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-830:
--

Attachment: pig-830-v2.patch

Sorry about that. New version attached, passes the test this time.

 Port Apache Log parsing piggybank contrib to Pig 0.2
 

 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig-830-v2.patch, pig-830.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt


 The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
 pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
 merged in.
 They should be updated to work with the current APIs and added back into 
 trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-02 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-830:
--

Status: Patch Available  (was: Open)

 Port Apache Log parsing piggybank contrib to Pig 0.2
 

 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig-830-v2.patch, pig-830.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt


 The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
 pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
 merged in.
 They should be updated to work with the current APIs and added back into 
 trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-02 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-830:
--

Attachment: pig-830-v3.patch

As I experimented with these classes, I realized that the naive implementation 
that used a regex to capture strings, and return a tuple of strings, is not 
appropriate for the typed version of Pig, since one may want to cast various 
fields into integers, etc.  The attached version returns a tuple of 
DataByteArrays , instead.

 Port Apache Log parsing piggybank contrib to Pig 0.2
 

 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt


 The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
 pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
 merged in.
 They should be updated to work with the current APIs and added back into 
 trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-05-29 Thread Dmitriy V. Ryaboy (JIRA)

PIG_HADOOP_VERSION should be 18
---

 Key: PIG-825
 URL: https://issues.apache.org/jira/browse/PIG-825
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Dmitriy V. Ryaboy


PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
considered default.
Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-05-29 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-825:
--

Attachment: pig-825.patch

Attached trivial patch, please review.

 PIG_HADOOP_VERSION should be 18
 ---

 Key: PIG-825
 URL: https://issues.apache.org/jira/browse/PIG-825
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Dmitriy V. Ryaboy
 Attachments: pig-825.patch


 PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
 considered default.
 Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-05-29 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-825:
--

Attachment: pig-825.patch

Minor update to minor patch --fixed a typo in the bug number in CHANGES.txt

 PIG_HADOOP_VERSION should be 18
 ---

 Key: PIG-825
 URL: https://issues.apache.org/jira/browse/PIG-825
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Dmitriy V. Ryaboy
 Attachments: pig-825.patch, pig-825.patch


 PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
 considered default.
 Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

< 1 2 3 4

301 - 359 of 359 matches

Mail list logo