[
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580386#comment-13580386
]
Russell Jurney commented on PIG-3015:
-------------------------------------
Loading data without going to Piggybank is amazing. However, Trevnistorage
fails to store my emails, schema:
You can reproduce this data with your own gmail emails (just need a few) with
these instructions: https://github.com/rjurney/Agile_Data_Code/tree/master/ch03
grunt> describe emails
emails: {message_id: chararray,thread_id: chararray,in_reply_to:
chararray,subject: chararray,body: chararray,date: chararray,from: (real_name:
chararray,address: chararray),tos: {to: (real_name: chararray,address:
chararray)},ccs: {cc: (real_name: chararray,address: chararray)},bccs: {bcc:
(real_name: chararray,address: chararray)},reply_tos: {reply_to: (real_name:
chararray,address: chararray)}}
Error:
2013-02-17 18:03:31,574 [Thread-6] INFO org.apache.hadoop.mapred.MapTask -
io.sort.mb = 100
2013-02-17 18:03:31,680 [Thread-6] INFO org.apache.hadoop.mapred.MapTask -
data buffer = 79691776/99614720
2013-02-17 18:03:31,680 [Thread-6] INFO org.apache.hadoop.mapred.MapTask -
record buffer = 262144/327680
2013-02-17 18:03:31,699 [Thread-6] INFO org.apache.pig.data.SchemaTupleBackend
- Key [pig.schematuple] was not set... will not generate code.
2013-02-17 18:03:31,713 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map
- Aliases being processed per job phase (AliasName[line,offset]): M:
emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27]
C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18]
2013-02-17 18:03:31,748 [Thread-6] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while
executing [POUserFunc (Name:
POUserFunc(org.apache.pig.builtin.LuceneTokenize)[bag] - scope-19 Operator Key:
scope-19) children: null at []]: java.lang.NullPointerException
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:370)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:126)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:242)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NullPointerException
at
org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.zzRefill(StandardTokenizerImpl31.java:795)
at
org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.getNextToken(StandardTokenizerImpl31.java:1002)
at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
at
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:70)
at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:51)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:380)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:341)
... 18 more
2013-02-17 18:03:31,811 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0001
2013-02-17 18:03:31,811 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Processing aliases 1-84,doc_word_totals,emails,token_records
2013-02-17 18:03:31,811 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations: M:
emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27]
C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18]
2013-02-17 18:03:31,813 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2013-02-17 18:03:31,817 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop
immediately on failure.
2013-02-17 18:03:31,817 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_local_0001 has failed! Stop running all dependent jobs
2013-02-17 18:03:31,817 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2013-02-17 18:03:31,818 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil
- 1 map reduce job(s) failed!
2013-02-17 18:03:31,818 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
reported below may be incomplete
2013-02-17 18:03:31,819 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.12.0-SNAPSHOT rjurney 2013-02-17 18:03:31 2013-02-17 18:03:31
HASH_JOIN,GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_local_0001 1-84,doc_word_totals,emails,token_records
MULTI_QUERY,COMBINER Message: Job failed! Error - NA
Input(s):
Failed to read data from "/me/Data/test_mbox"
Output(s):
Job DAG:
job_local_0001 -> null,
null -> null,null,
null -> null,
null -> null,
null -> null,
null
2013-02-17 18:03:31,819 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed!
> Rewrite of AvroStorage
> ----------------------
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
> Issue Type: Improvement
> Components: piggybank
> Reporter: Joseph Adler
> Assignee: Joseph Adler
> Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch,
> PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch,
> PIG-3015-doc.patch, TestInput.java, Test.java
>
>
> The current AvroStorage implementation has a lot of issues: it requires old
> versions of Avro, it copies data much more than needed, and it's verbose and
> complicated. (One pet peeve of mine is that old versions of Avro don't
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the
> new implementation is significantly faster, and the code is a lot simpler.
> Rewriting AvroStorage also enabled me to implement support for Trevni (as
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best
> way to contribute the changes back to Apache.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira