Re: task getting stuck

2014-09-24 Thread Ted Yu
Adding a subject.

bq.   at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(
ParquetFileReader.java:599)

Looks like there might be some issue reading the Parquet file.

Cheers

On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi Ted,

 See my previous reply to Debasish, all region servers are idle. I don't
 think it's caused by hotspotting.

 Besides, only 6 out of 3000 tasks were stuck, and their inputs are about
 only 80MB each.

 Jianshi

 On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was thinking along the same line.

 Jianshi:
 See
 http://hbase.apache.org/book.html#d0e6369

 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 HBase regionserver needs to be balancedyou might have some skewness
 in row keys and one regionserver is under pressuretry finding that key
 and replicate it using random salt

 On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi Ted,

 It converts RDD[Edge] to HBase rowkey and columns and insert them to
 HBase (in batch).

 BTW, I found batched Put actually faster than generating HFiles...


 Jianshi

 On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. at com.paypal.risk.rds.dragon.
 storage.hbase.HbaseRDDBatch$$anonfun$batchInsertEdges$3.
 apply(HbaseRDDBatch.scala:179)

 Can you reveal what HbaseRDDBatch.scala does ?

 Cheers

 On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 One of my big spark program always get stuck at 99% where a few tasks
 never finishes.

 I debugged it by printing out thread stacktraces, and found there're
 workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup.

 Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1.
 The parquet files are generated by pig using latest parquet-pig-bundle
 v1.6.0rc1.

 From Spark 1.1.0's pom.xml, Spark is using parquet v1.4.3, will this
 be problematic?

 One of the weird behavior is that another program read and sort data
 read from the same parquet files and it works fine. The only difference
 seems the buggy program uses foreachPartition and the working program 
 uses
 map.

 Here's the full stacktrace:

 Executor task launch worker-3
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at
 sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:257)
 at
 sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
 at
 sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
 at
 org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
 at
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
 at java.io.DataInputStream.readFully(DataInputStream.java:195)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at
 parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
 at
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
 at
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 at
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 at
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 at
 

Re: task getting stuck

2014-09-24 Thread Debasish Das
spark SQL reads parquet file fine...did you follow one of these to
read/write parquet from spark ?

http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu yuzhih...@gmail.com wrote:

 Adding a subject.

 bq.   at parquet.hadoop.ParquetFileReader$
 ConsecutiveChunkList.readAll(ParquetFileReader.java:599)

 Looks like there might be some issue reading the Parquet file.

 Cheers

 On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi Ted,

 See my previous reply to Debasish, all region servers are idle. I don't
 think it's caused by hotspotting.

 Besides, only 6 out of 3000 tasks were stuck, and their inputs are about
 only 80MB each.

 Jianshi

 On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was thinking along the same line.

 Jianshi:
 See
 http://hbase.apache.org/book.html#d0e6369

 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 HBase regionserver needs to be balancedyou might have some skewness
 in row keys and one regionserver is under pressuretry finding that key
 and replicate it using random salt

 On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Hi Ted,

 It converts RDD[Edge] to HBase rowkey and columns and insert them to
 HBase (in batch).

 BTW, I found batched Put actually faster than generating HFiles...


 Jianshi

 On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. at com.paypal.risk.rds.dragon.
 storage.hbase.HbaseRDDBatch$$anonfun$batchInsertEdges$3.
 apply(HbaseRDDBatch.scala:179)

 Can you reveal what HbaseRDDBatch.scala does ?

 Cheers

 On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 One of my big spark program always get stuck at 99% where a few
 tasks never finishes.

 I debugged it by printing out thread stacktraces, and found there're
 workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup.

 Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1.
 The parquet files are generated by pig using latest parquet-pig-bundle
 v1.6.0rc1.

 From Spark 1.1.0's pom.xml, Spark is using parquet v1.4.3, will this
 be problematic?

 One of the weird behavior is that another program read and sort data
 read from the same parquet files and it works fine. The only difference
 seems the buggy program uses foreachPartition and the working program 
 uses
 map.

 Here's the full stacktrace:

 Executor task launch worker-3
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at
 sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:257)
 at
 sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
 at
 sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
 at
 org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
 at
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
 at
 java.io.DataInputStream.readFully(DataInputStream.java:195)
 at
 java.io.DataInputStream.readFully(DataInputStream.java:169)
 at
 parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
 at
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
 at
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 at
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 at
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
 at