[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580386#comment-13580386 ]
Russell Jurney commented on PIG-3015: ------------------------------------- Loading data without going to Piggybank is amazing. However, Trevnistorage fails to store my emails, schema: You can reproduce this data with your own gmail emails (just need a few) with these instructions: https://github.com/rjurney/Agile_Data_Code/tree/master/ch03 grunt> describe emails emails: {message_id: chararray,thread_id: chararray,in_reply_to: chararray,subject: chararray,body: chararray,date: chararray,from: (real_name: chararray,address: chararray),tos: {to: (real_name: chararray,address: chararray)},ccs: {cc: (real_name: chararray,address: chararray)},bccs: {bcc: (real_name: chararray,address: chararray)},reply_tos: {reply_to: (real_name: chararray,address: chararray)}} Error: 2013-02-17 18:03:31,574 [Thread-6] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100 2013-02-17 18:03:31,680 [Thread-6] INFO org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720 2013-02-17 18:03:31,680 [Thread-6] INFO org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680 2013-02-17 18:03:31,699 [Thread-6] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2013-02-17 18:03:31,713 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27] C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18] 2013-02-17 18:03:31,748 [Thread-6] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(org.apache.pig.builtin.LuceneTokenize)[bag] - scope-19 Operator Key: scope-19) children: null at []]: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:370) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:126) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:242) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.NullPointerException at org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.zzRefill(StandardTokenizerImpl31.java:795) at org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.getNextToken(StandardTokenizerImpl31.java:1002) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50) at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:70) at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:51) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:380) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:341) ... 18 more 2013-02-17 18:03:31,811 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001 2013-02-17 18:03:31,811 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases 1-84,doc_word_totals,emails,token_records 2013-02-17 18:03:31,811 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27] C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18] 2013-02-17 18:03:31,813 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2013-02-17 18:03:31,817 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2013-02-17 18:03:31,817 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local_0001 has failed! Stop running all dependent jobs 2013-02-17 18:03:31,817 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2013-02-17 18:03:31,818 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2013-02-17 18:03:31,818 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete 2013-02-17 18:03:31,819 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.3 0.12.0-SNAPSHOT rjurney 2013-02-17 18:03:31 2013-02-17 18:03:31 HASH_JOIN,GROUP_BY Failed! Failed Jobs: JobId Alias Feature Message Outputs job_local_0001 1-84,doc_word_totals,emails,token_records MULTI_QUERY,COMBINER Message: Job failed! Error - NA Input(s): Failed to read data from "/me/Data/test_mbox" Output(s): Job DAG: job_local_0001 -> null, null -> null,null, null -> null, null -> null, null -> null, null 2013-02-17 18:03:31,819 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! > Rewrite of AvroStorage > ---------------------- > > Key: PIG-3015 > URL: https://issues.apache.org/jira/browse/PIG-3015 > Project: Pig > Issue Type: Improvement > Components: piggybank > Reporter: Joseph Adler > Assignee: Joseph Adler > Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, > PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, > PIG-3015-doc.patch, TestInput.java, Test.java > > > The current AvroStorage implementation has a lot of issues: it requires old > versions of Avro, it copies data much more than needed, and it's verbose and > complicated. (One pet peeve of mine is that old versions of Avro don't > support Snappy compression.) > I rewrote AvroStorage from scratch to fix these issues. In early tests, the > new implementation is significantly faster, and the code is a lot simpler. > Rewriting AvroStorage also enabled me to implement support for Trevni (as > TrevniStorage). > I'm opening this ticket to facilitate discussion while I figure out the best > way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira