Re: Storing statistics of input dataset

2012-08-06 Thread Bill Graham
There are a few open JIRAs that are related to refactoring the query plan
code to allow for stats-based runtime optimizations:

https://issues.apache.org/jira/browse/PIG-483
https://issues.apache.org/jira/browse/PIG-2784

If anyone has thoughts/opinions around suggested design changes, those
JIRAs could be a good place to chime it.


On Mon, Aug 6, 2012 at 5:18 PM, Dmitriy Ryaboy  wrote:

> + 1 to that.
>
> We can get stats from the Hive metadata catalog via HCat. Loaders can
> already implement the LoadStatistics interface -- and if HCatLoader
> does this, we can create them via Hive and use that team's great work.
> We should also allow stats to be passed (and modified appropriately)
> through the dag, and instrument intermediate data writers to collect
> stats and send telemetry back for improved flow planning, but that's a
> separate conversation.
>
> D
>
> On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates  wrote:
> > Pig does not have a metadata store, so it doesn't store statistics on
> data.  However, through HCatalog it will have access to the same statistics
> that Hive stores.
> >
> > As far as using this data to optimize Pig operations, I'd like to rework
> the backend to start taking advantage of such statistics when available
> (either from metadata like this or statistics that are generated on the fly
> as scripts are executed).  I also hope to share as much of this work as
> possible with Hive so that both can benefit.
> >
> > Alan.
> >
> > On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
> >
> >> Hello everyone
> >>
> >> Came across this excellent post about storing column statistics in Hive
> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> >>
> >> Does pig gather statistics similar to what hive does? I think gathering
> such statistics will be very helpful not only for cost based optimizer but
> in other cases like knowing the count of rows, knowing the histogram of
> underlying data etc.. In my case, I am working on cube computation for
> holistic measure where I need to know the count of rows, based on it I can
> load sample data set for determining the partition factor for large groups.
> I am sure gathering statistics and persisting it will help in other
> cases/optimizations as well.
> >>
> >> If I am right, pig doesn't use cost based estimation while optimizing
> the logical plan instead I believe it uses rules of thumb (Plz. correct me
> if I am wrong). Having statistics about the datasets would help to provide
> better optimization (similar to the join optimization in the blog post).
> Any thoughts about having such statistics in pig and implementing ANALYZE
> command for gathering statistics?
> >>
> >> Thanks
> >> -- Prasanth Jayachandran
> >>
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: dynamodb for pig

2012-08-06 Thread Bill Graham
The best places for code like this is github IMO. That allows the code to
be developed and released independently of Pig or DynamoDBs release cycle.



On Mon, Aug 6, 2012 at 10:47 AM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:

> Hi,
>
> These integration classes, do they go into the contribs? or where do
> they fit in Pig's project?
> Cesar, if you need any help on doing this, please just let me know. I
> am in the middle of finishing Gora's integration with Amazon DynamoDB,
> and we have plans to integrate Gora with Pig downstream as well (:
>
>
> Renato M.
>
> 2012/8/6 Bill Graham :
> > Hi Cesar,
> >
> > I'm not aware of any DynamoDB integration, but it looks like AWS has it
> on
> > their radar:
> > https://forums.aws.amazon.com/thread.jspa?messageID=337502
> >
> > If you implement something yourself, I recommend looking at the
> > HBaseStorage class in Pig to see an example of how to integrate with an
> > external DB.
> >
> > thanks,
> > Bill
> >
> > On Sat, Aug 4, 2012 at 11:26 AM, cesar romero  wrote:
> >
> >> I'm interesting in being able to load and store into DynamoDB tables.
> >> I couldn't find a related JIRA issue[1], and I'm willing to work on it
> >> myself. Is there an issue I missed or can I just create a JIRA issue
> >> myself and start tracking this? Any pointers would be greatly
> >> appreciated as I have not contributed to an apache project in the
> >> past.
> >>
> >> Cesar
> >>
> >> [1]
> >>
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=summary+~+dynamodb+OR+description+~+dynamodb+OR+comment+~+dynamodb
> >>
> >
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > billgra...@gmail.com going forward.*
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: Storing statistics of input dataset

2012-08-06 Thread Dmitriy Ryaboy
+ 1 to that.

We can get stats from the Hive metadata catalog via HCat. Loaders can
already implement the LoadStatistics interface -- and if HCatLoader
does this, we can create them via Hive and use that team's great work.
We should also allow stats to be passed (and modified appropriately)
through the dag, and instrument intermediate data writers to collect
stats and send telemetry back for improved flow planning, but that's a
separate conversation.

D

On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates  wrote:
> Pig does not have a metadata store, so it doesn't store statistics on data.  
> However, through HCatalog it will have access to the same statistics that 
> Hive stores.
>
> As far as using this data to optimize Pig operations, I'd like to rework the 
> backend to start taking advantage of such statistics when available (either 
> from metadata like this or statistics that are generated on the fly as 
> scripts are executed).  I also hope to share as much of this work as possible 
> with Hive so that both can benefit.
>
> Alan.
>
> On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
>
>> Hello everyone
>>
>> Came across this excellent post about storing column statistics in Hive 
>> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
>>
>> Does pig gather statistics similar to what hive does? I think gathering such 
>> statistics will be very helpful not only for cost based optimizer but in 
>> other cases like knowing the count of rows, knowing the histogram of 
>> underlying data etc.. In my case, I am working on cube computation for 
>> holistic measure where I need to know the count of rows, based on it I can 
>> load sample data set for determining the partition factor for large groups. 
>> I am sure gathering statistics and persisting it will help in other 
>> cases/optimizations as well.
>>
>> If I am right, pig doesn't use cost based estimation while optimizing the 
>> logical plan instead I believe it uses rules of thumb (Plz. correct me if I 
>> am wrong). Having statistics about the datasets would help to provide better 
>> optimization (similar to the join optimization in the blog post). Any 
>> thoughts about having such statistics in pig and implementing ANALYZE 
>> command for gathering statistics?
>>
>> Thanks
>> -- Prasanth Jayachandran
>>
>


[jira] [Updated] (PIG-2861) PlanHelper imports org.python.google.common.collect.Lists instead of org.google.common.collect.Lists

2012-08-06 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2861:
--

Assignee: Jonathan Coveney
  Status: Patch Available  (was: Open)

> PlanHelper imports org.python.google.common.collect.Lists instead of 
> org.google.common.collect.Lists
> 
>
> Key: PIG-2861
> URL: https://issues.apache.org/jira/browse/PIG-2861
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2861-0.patch
>
>
> Fix is easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2861) PlanHelper imports org.python.google.common.collect.Lists instead of org.google.common.collect.Lists

2012-08-06 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2861:
--

Attachment: PIG-2861-0.patch

> PlanHelper imports org.python.google.common.collect.Lists instead of 
> org.google.common.collect.Lists
> 
>
> Key: PIG-2861
> URL: https://issues.apache.org/jira/browse/PIG-2861
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2861-0.patch
>
>
> Fix is easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2861) PlanHelper imports org.python.google.common.collect.Lists instead of org.google.common.collect.Lists

2012-08-06 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2861:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> PlanHelper imports org.python.google.common.collect.Lists instead of 
> org.google.common.collect.Lists
> 
>
> Key: PIG-2861
> URL: https://issues.apache.org/jira/browse/PIG-2861
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2861-0.patch
>
>
> Fix is easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2861) PlanHelper imports org.python.google.common.collect.Lists instead of org.google.common.collect.Lists

2012-08-06 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2861:
--

Summary: PlanHelper imports org.python.google.common.collect.Lists instead 
of org.google.common.collect.Lists  (was: Imported 
org.python.google.common.collect.Lists instead of 
org.google.common.collect.Lists)

> PlanHelper imports org.python.google.common.collect.Lists instead of 
> org.google.common.collect.Lists
> 
>
> Key: PIG-2861
> URL: https://issues.apache.org/jira/browse/PIG-2861
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
> Fix For: 0.11
>
>
> Fix is easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2861) Imported org.python.google.common.collect.Lists instead of org.google.common.collect.Lists

2012-08-06 Thread Jonathan Coveney (JIRA)
Jonathan Coveney created PIG-2861:
-

 Summary: Imported org.python.google.common.collect.Lists instead 
of org.google.common.collect.Lists
 Key: PIG-2861
 URL: https://issues.apache.org/jira/browse/PIG-2861
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
 Fix For: 0.11


Fix is easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2860) TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of hadoop

2012-08-06 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2860:
---

Status: Patch Available  (was: Open)

> TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of 
> hadoop
> 
>
> Key: PIG-2860
> URL: https://issues.apache.org/jira/browse/PIG-2860
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2860.patch
>
>
> I found that TestAvroStorageUtils.testGetConcretePathFromGlob fails on some 
> versions of hadoop (not ones that upstream Pig is currently using) with the 
> following error:
> {code}
> Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:55883 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 
> localhost.localdomain:55883 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:722)
> at org.apache.hadoop.ipc.Client.call(Client.java:1164)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:195)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:613)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1399)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:740)
> at org.apache.hadoop.fs.FileSystem.getFileStatus(FileSystem.java:2083)
> at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1547)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1488)
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getConcretePathFromGlob(AvroStorageUtils.java:146)
> at 
> org.apache.pig.piggybank.test.storage.avro.TestAvroStorageUtils.testGetConcretePathFromGlob(TestAvroStorageUtils.java:142)
> {code}
> The fix is to explicitly add the URI scheme "file://" to the path that is 
> used in the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2860) TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of hadoop

2012-08-06 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429586#comment-13429586
 ] 

Cheolsoo Park commented on PIG-2860:


Review board:
https://reviews.apache.org/r/6412/

> TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of 
> hadoop
> 
>
> Key: PIG-2860
> URL: https://issues.apache.org/jira/browse/PIG-2860
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2860.patch
>
>
> I found that TestAvroStorageUtils.testGetConcretePathFromGlob fails on some 
> versions of hadoop (not ones that upstream Pig is currently using) with the 
> following error:
> {code}
> Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:55883 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 
> localhost.localdomain:55883 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:722)
> at org.apache.hadoop.ipc.Client.call(Client.java:1164)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:195)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:613)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1399)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:740)
> at org.apache.hadoop.fs.FileSystem.getFileStatus(FileSystem.java:2083)
> at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1547)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1488)
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getConcretePathFromGlob(AvroStorageUtils.java:146)
> at 
> org.apache.pig.piggybank.test.storage.avro.TestAvroStorageUtils.testGetConcretePathFromGlob(TestAvroStorageUtils.java:142)
> {code}
> The fix is to explicitly add the URI scheme "file://" to the path that is 
> used in the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2860) TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of hadoop

2012-08-06 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2860:
---

Attachment: PIG-2860.patch

> TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of 
> hadoop
> 
>
> Key: PIG-2860
> URL: https://issues.apache.org/jira/browse/PIG-2860
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2860.patch
>
>
> I found that TestAvroStorageUtils.testGetConcretePathFromGlob fails on some 
> versions of hadoop (not ones that upstream Pig is currently using) with the 
> following error:
> {code}
> Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:55883 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 
> localhost.localdomain:55883 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:722)
> at org.apache.hadoop.ipc.Client.call(Client.java:1164)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:195)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at $Proxy12.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:613)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1399)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:740)
> at org.apache.hadoop.fs.FileSystem.getFileStatus(FileSystem.java:2083)
> at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1547)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1488)
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getConcretePathFromGlob(AvroStorageUtils.java:146)
> at 
> org.apache.pig.piggybank.test.storage.avro.TestAvroStorageUtils.testGetConcretePathFromGlob(TestAvroStorageUtils.java:142)
> {code}
> The fix is to explicitly add the URI scheme "file://" to the path that is 
> used in the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2860) TestAvroStorageUtils.testGetConcretePathFromGlob fails on some version of hadoop

2012-08-06 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-2860:
--

 Summary: TestAvroStorageUtils.testGetConcretePathFromGlob fails on 
some version of hadoop
 Key: PIG-2860
 URL: https://issues.apache.org/jira/browse/PIG-2860
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.11


I found that TestAvroStorageUtils.testGetConcretePathFromGlob fails on some 
versions of hadoop (not ones that upstream Pig is currently using) with the 
following error:

{code}
Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:55883 failed 
on connection exception: java.net.ConnectException: Connection refused; For 
more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 
localhost.localdomain:55883 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:722)
at org.apache.hadoop.ipc.Client.call(Client.java:1164)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:195)
at $Proxy12.getFileInfo(Unknown Source)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy12.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:613)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1399)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:740)
at org.apache.hadoop.fs.FileSystem.getFileStatus(FileSystem.java:2083)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1547)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1488)
at 
org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getConcretePathFromGlob(AvroStorageUtils.java:146)
at 
org.apache.pig.piggybank.test.storage.avro.TestAvroStorageUtils.testGetConcretePathFromGlob(TestAvroStorageUtils.java:142)
{code}

The fix is to explicitly add the URI scheme "file://" to the path that is used 
in the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2837) AvroStorage throws StackOverFlowError

2012-08-06 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429564#comment-13429564
 ] 

Cheolsoo Park commented on PIG-2837:


Actually, I have clean runs on Mac in both hadoop 20 and 23 with following 
commands:

{code:title=hadoop 20}
ant clean compile-test jar-withouthadoop -Dhadoopversion=20
cd contrib/piggybank/java
ant clean test -Dhadoopversion=20
{code}

{code:title=hadoop 23}
ant clean compile-test jar-withouthadoop -Dhadoopversion=23
cd contrib/piggybank/java
JAVA_HOME=`/usr/libexec/java_home` ant clean test -Dhadoopversion=23
{code}

Thanks!

> AvroStorage throws StackOverFlowError
> -
>
> Key: PIG-2837
> URL: https://issues.apache.org/jira/browse/PIG-2837
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Mubarak Seyed
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2837-2.patch, PIG-2837.patch, avro_test_files.tar.gz
>
>
> When i try to dump avro data using
> {code}
> records = LOAD '/logs/records/07262012/01/1/Record.1343265732700.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage(); 
> dump records;
> {code}
> {code}
> Pig Stack Trace 
> --- 
> ERROR 2998: Unhandled internal error. null
> java.lang.StackOverflowError 
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:258)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
> {code}
> I did verify the avro schema using avro-tools and dump the data as json 
> format, data looks good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-08-06 Thread Jakob Homan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429456#comment-13429456
 ] 

Jakob Homan commented on PIG-1891:
--

This looks good to me.  +1 on the patch, for what it's worth.  This is what 
we're looking for.  [~billgraham], how does this look to you?

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Priority: Minor
>  Labels: patch
> Attachments: PIG-1891-1.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2319) Pig should support snappy as a value for pig.tmpfilecompression.codec

2012-08-06 Thread Rakesh Kothari (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429320#comment-13429320
 ] 

Rakesh Kothari commented on PIG-2319:
-

Any updates on this ?

> Pig should support snappy as a value for pig.tmpfilecompression.codec
> -
>
> Key: PIG-2319
> URL: https://issues.apache.org/jira/browse/PIG-2319
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.1, 0.9.1
>Reporter: Joe Crobak
>
> Utils.tmpFileCompressionCodec() hard-codes support for only "gz" and "lzo" 
> compression.  Since support for snappy was added in HADOOP-7206, it would be 
> nice to allow this codec as well.
> A future-proof solution to this problem might let the user provide a full 
> classname (like in the hadoop settings) or the short-hand, in case the 
> short-hand doesn't exist for a given codec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: dynamodb for pig

2012-08-06 Thread Renato Marroquín Mogrovejo
Hi,

These integration classes, do they go into the contribs? or where do
they fit in Pig's project?
Cesar, if you need any help on doing this, please just let me know. I
am in the middle of finishing Gora's integration with Amazon DynamoDB,
and we have plans to integrate Gora with Pig downstream as well (:


Renato M.

2012/8/6 Bill Graham :
> Hi Cesar,
>
> I'm not aware of any DynamoDB integration, but it looks like AWS has it on
> their radar:
> https://forums.aws.amazon.com/thread.jspa?messageID=337502
>
> If you implement something yourself, I recommend looking at the
> HBaseStorage class in Pig to see an example of how to integrate with an
> external DB.
>
> thanks,
> Bill
>
> On Sat, Aug 4, 2012 at 11:26 AM, cesar romero  wrote:
>
>> I'm interesting in being able to load and store into DynamoDB tables.
>> I couldn't find a related JIRA issue[1], and I'm willing to work on it
>> myself. Is there an issue I missed or can I just create a JIRA issue
>> myself and start tracking this? Any pointers would be greatly
>> appreciated as I have not contributed to an apache project in the
>> past.
>>
>> Cesar
>>
>> [1]
>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=summary+~+dynamodb+OR+description+~+dynamodb+OR+comment+~+dynamodb
>>
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgra...@gmail.com going forward.*


Re: Storing statistics of input dataset

2012-08-06 Thread Alan Gates
Pig does not have a metadata store, so it doesn't store statistics on data.  
However, through HCatalog it will have access to the same statistics that Hive 
stores.  

As far as using this data to optimize Pig operations, I'd like to rework the 
backend to start taking advantage of such statistics when available (either 
from metadata like this or statistics that are generated on the fly as scripts 
are executed).  I also hope to share as much of this work as possible with Hive 
so that both can benefit.

Alan.

On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:

> Hello everyone
> 
> Came across this excellent post about storing column statistics in Hive 
> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> 
> Does pig gather statistics similar to what hive does? I think gathering such 
> statistics will be very helpful not only for cost based optimizer but in 
> other cases like knowing the count of rows, knowing the histogram of 
> underlying data etc.. In my case, I am working on cube computation for 
> holistic measure where I need to know the count of rows, based on it I can 
> load sample data set for determining the partition factor for large groups. I 
> am sure gathering statistics and persisting it will help in other 
> cases/optimizations as well.
> 
> If I am right, pig doesn't use cost based estimation while optimizing the 
> logical plan instead I believe it uses rules of thumb (Plz. correct me if I 
> am wrong). Having statistics about the datasets would help to provide better 
> optimization (similar to the join optimization in the blog post). Any 
> thoughts about having such statistics in pig and implementing ANALYZE command 
> for gathering statistics?
> 
> Thanks
> -- Prasanth Jayachandran
> 



[jira] [Commented] (PIG-2837) AvroStorage throws StackOverFlowError

2012-08-06 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429274#comment-13429274
 ] 

Cheolsoo Park commented on PIG-2837:


@Santhosh,

Thanks for reviewing my patch!

Regarding the other test failures,

1) If I run the following commands, I don't see TestDBStorage failure in Hadoop 
20:
{code}
ant clean compile-test jar-withouthadoop -Dhadoopversion=20 // Note 
jar-withouthadoop
cd contrib/piggybank/java
ant clean test -Dtestcase=TestAvroStorage -Dhadoopversion=20
{code}

2) I believe that TestDBStorage/TestMultiStorage/TestLookupInFiles failures are 
due to some mini cluster mis-configuration:
{code}
2012-08-06 01:13:52,340 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:transition(533)) - Application application_1344240816627_0001 
failed 1 times due to AM Container for appatt 
empt_1344240816627_0001_01 exited with  exitCode: 127 due to:
{code}

While I see them failing on Mac, they run fine on CentOS 6. I am wondering if 
it would be better to modify these tests so that they will run in local mode 
instead of mr mode.

Thanks! 

> AvroStorage throws StackOverFlowError
> -
>
> Key: PIG-2837
> URL: https://issues.apache.org/jira/browse/PIG-2837
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Mubarak Seyed
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2837-2.patch, PIG-2837.patch, avro_test_files.tar.gz
>
>
> When i try to dump avro data using
> {code}
> records = LOAD '/logs/records/07262012/01/1/Record.1343265732700.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage(); 
> dump records;
> {code}
> {code}
> Pig Stack Trace 
> --- 
> ERROR 2998: Unhandled internal error. null
> java.lang.StackOverflowError 
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:258)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
> {code}
> I did verify the avro schema using avro-tools and dump the data as json 
> format, data looks good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

Re: dynamodb for pig

2012-08-06 Thread Bill Graham
Hi Cesar,

I'm not aware of any DynamoDB integration, but it looks like AWS has it on
their radar:
https://forums.aws.amazon.com/thread.jspa?messageID=337502

If you implement something yourself, I recommend looking at the
HBaseStorage class in Pig to see an example of how to integrate with an
external DB.

thanks,
Bill

On Sat, Aug 4, 2012 at 11:26 AM, cesar romero  wrote:

> I'm interesting in being able to load and store into DynamoDB tables.
> I couldn't find a related JIRA issue[1], and I'm willing to work on it
> myself. Is there an issue I missed or can I just create a JIRA issue
> myself and start tracking this? Any pointers would be greatly
> appreciated as I have not contributed to an apache project in the
> past.
>
> Cesar
>
> [1]
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=summary+~+dynamodb+OR+description+~+dynamodb+OR+comment+~+dynamodb
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


dynamodb for pig

2012-08-06 Thread cesar romero
I'm interesting in being able to load and store into DynamoDB tables.
I couldn't find a related JIRA issue[1], and I'm willing to work on it
myself. Is there an issue I missed or can I just create a JIRA issue
myself and start tracking this? Any pointers would be greatly
appreciated as I have not contributed to an apache project in the
past.

Cesar

[1] 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=summary+~+dynamodb+OR+description+~+dynamodb+OR+comment+~+dynamodb


Build failed in Jenkins: Pig-trunk #1291

2012-08-06 Thread Apache Jenkins Server
See 

Changes:

[sms] PIG-2837: AvroStorage throws StackOverFlowError (cheolsoo via sms)

[sms] PIG-2856: AvroStorage doesn't load files in the directories when a glob 
pattern matches both files and directories. (cheolsoo via sms)

--
[...truncated 38413 lines...]
[junit] at 
org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:788)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:566)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:550)
[junit] at 
org.apache.pig.test.MiniGenericCluster.shutdownMiniDfsClusters(MiniGenericCluster.java:87)
[junit] at 
org.apache.pig.test.MiniGenericCluster.shutdownMiniDfsAndMrClusters(MiniGenericCluster.java:77)
[junit] at 
org.apache.pig.test.MiniGenericCluster.shutDown(MiniGenericCluster.java:68)
[junit] at 
org.apache.pig.test.TestStore.oneTimeTearDown(TestStore.java:129)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
[junit] at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
[junit] at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
[junit] at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:37)
[junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
[junit] at 
junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
[junit] 12/08/06 10:34:44 WARN datanode.FSDatasetAsyncDiskService: 
AsyncDiskService has already shut down.
[junit] Shutting down DataNode 2
[junit] 12/08/06 10:34:44 INFO mortbay.log: Stopped 
SelectChannelConnector@localhost:0
[junit] 12/08/06 10:34:45 INFO ipc.Server: Stopping server on 45679
[junit] 12/08/06 10:34:45 INFO ipc.Server: IPC Server handler 0 on 45679: 
exiting
[junit] 12/08/06 10:34:45 INFO ipc.Server: IPC Server handler 2 on 45679: 
exiting
[junit] 12/08/06 10:34:45 INFO metrics.RpcInstrumentation: shut down
[junit] 12/08/06 10:34:45 INFO ipc.Server: Stopping IPC Server listener on 
45679
[junit] 12/08/06 10:34:45 INFO ipc.Server: Stopping IPC Server Responder
[junit] 12/08/06 10:34:45 WARN datanode.DataNode: 
DatanodeRegistration(127.0.0.1:58002, 
storageID=DS-119781900-67.195.138.20-58002-1344248791448, infoPort=47138, 
ipcPort=45679):DataXceiveServer:java.nio.channels.AsynchronousCloseException
[junit] at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
[junit] at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:159)
[junit] at 
sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:84)
[junit] at 
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:131)
[junit] at java.lang.Thread.run(Thread.java:662)
[junit] 
[junit] 12/08/06 10:34:45 INFO ipc.Server: IPC Server handler 1 on 45679: 
exiting
[junit] 12/08/06 10:34:45 INFO datanode.DataNode: Exiting DataXceiveServer
[junit] 12/08/06 10:34:45 INFO datanode.DataNode: Waiting for threadgroup 
to exit, active threads is 1
[junit] 12/08/06 10:34:45 INFO datanode.DataBlockScanner: Exiting 
DataBlockScanner thread.
[junit] 12/08/06 10:34:45 INFO datanode.DataNode: 
DatanodeRegistration(127.0.0.1:58002, 
storageID=DS-119781900-67.195.138.20-58002-1344248791448, infoPort=47138, 
ipcPort=45679):Finishing DataNode in: 
FSDataset{dirpath='
[junit] 12/08/06 10:34:45 INFO ipc.Server: Stopping server on 45679
[junit] 12/08/06 10:34:45 INFO metrics.RpcInstrumentation: shut down
[junit] 12/08/06 10:34:45 INFO datanode.DataNode: Waiting for threadgroup 
to exit, active threads is 0
[junit] 12/08/06 10:34:45 INFO datanode.FSDatasetAsyncDiskService: Shutting 
down all async disk service threads...
[junit] 12/08/06 10:34:45 INFO datanode.FSDatasetAsyncDiskService: All 
async d

[jira] [Updated] (PIG-2837) AvroStorage throws StackOverFlowError

2012-08-06 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-2837:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> AvroStorage throws StackOverFlowError
> -
>
> Key: PIG-2837
> URL: https://issues.apache.org/jira/browse/PIG-2837
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Mubarak Seyed
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2837-2.patch, PIG-2837.patch, avro_test_files.tar.gz
>
>
> When i try to dump avro data using
> {code}
> records = LOAD '/logs/records/07262012/01/1/Record.1343265732700.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage(); 
> dump records;
> {code}
> {code}
> Pig Stack Trace 
> --- 
> ERROR 2998: Unhandled internal error. null
> java.lang.StackOverflowError 
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:258)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
> {code}
> I did verify the avro schema using avro-tools and dump the data as json 
> format, data looks good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2837) AvroStorage throws StackOverFlowError

2012-08-06 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-2837:
-

Fix Version/s: 0.11

> AvroStorage throws StackOverFlowError
> -
>
> Key: PIG-2837
> URL: https://issues.apache.org/jira/browse/PIG-2837
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Mubarak Seyed
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2837-2.patch, PIG-2837.patch, avro_test_files.tar.gz
>
>
> When i try to dump avro data using
> {code}
> records = LOAD '/logs/records/07262012/01/1/Record.1343265732700.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage(); 
> dump records;
> {code}
> {code}
> Pig Stack Trace 
> --- 
> ERROR 2998: Unhandled internal error. null
> java.lang.StackOverflowError 
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:258)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
> {code}
> I did verify the avro schema using avro-tools and dump the data as json 
> format, data looks good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2837) AvroStorage throws StackOverFlowError

2012-08-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429011#comment-13429011
 ] 

Santhosh Srinivasan commented on PIG-2837:
--

Patch has been committed. TestDBStorage (Hadoop 20 and Hadoop 23), 
TestMultiStorage (Hadoop 23) are failing and TestLookupInFiles in Hadoop 23 is 
erroring out. All of these are unrelated to this patch.

Thanks Cheolsoo!

> AvroStorage throws StackOverFlowError
> -
>
> Key: PIG-2837
> URL: https://issues.apache.org/jira/browse/PIG-2837
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Mubarak Seyed
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2837-2.patch, PIG-2837.patch, avro_test_files.tar.gz
>
>
> When i try to dump avro data using
> {code}
> records = LOAD '/logs/records/07262012/01/1/Record.1343265732700.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage(); 
> dump records;
> {code}
> {code}
> Pig Stack Trace 
> --- 
> ERROR 2998: Unhandled internal error. null
> java.lang.StackOverflowError 
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:258)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:262)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:271)
>  
> at 
> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.containsGenericUnion(AvroStorageUtils.java:284)
> {code}
> I did verify the avro schema using avro-tools and dump the data as json 
> format, data looks good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

2012-08-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429008#comment-13429008
 ] 

Santhosh Srinivasan commented on PIG-2856:
--

Forgot to add that TestLookupInFiles in Hadoop 23 is erroring out.

> AvroStorage doesn't load files in the directories when a glob pattern matches 
> both files and directories.
> -
>
> Key: PIG-2856
> URL: https://issues.apache.org/jira/browse/PIG-2856
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, 
> AvroStorage does not load files in the directories. This is a bug in 
> getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
> if (file.isDir()) {
> -for (FileStatus sub : fs.listStatus(path)) {
> +for (FileStatus sub : fs.listStatus(file.getPath())) {
> getAllSubDirs(sub.getPath(), job, paths);
> }
> }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

2012-08-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429009#comment-13429009
 ] 

Santhosh Srinivasan commented on PIG-2856:
--

Addendum to previous comment - its unrelated to this patch and existed 
previously.

> AvroStorage doesn't load files in the directories when a glob pattern matches 
> both files and directories.
> -
>
> Key: PIG-2856
> URL: https://issues.apache.org/jira/browse/PIG-2856
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.11
>
> Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, 
> AvroStorage does not load files in the directories. This is a bug in 
> getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
> if (file.isDir()) {
> -for (FileStatus sub : fs.listStatus(path)) {
> +for (FileStatus sub : fs.listStatus(file.getPath())) {
> getAllSubDirs(sub.getPath(), job, paths);
> }
> }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira