SQL warehouse dir

2017-02-10 Thread Joseph Naegele

Hi all,

I've read the docs for Spark SQL 2.1.0 but I'm still having issues with the 
warehouse and related details.

I'm not using Hive proper, so my hive-site.xml consists only of:

javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/mnt/data/spark/metastore_db;create=true

I've set "spark.sql.warehouse.dir" in my "spark-defaults.conf", however the 
location in my catalog doesn't match:

scala> spark.conf.get("spark.sql.warehouse.dir")
res8: String = file://mnt/data/spark/warehouse

scala> spark.conf.get("hive.metastore.warehouse.dir")
res9: String = file://mnt/data/spark/warehouse

scala> spark.catalog.listDatabases.show(false)
+---+-+-+
|name   |description  |locationUri |
+---+-+-+
|default|Default Hive database|file:/home/me/spark-warehouse|
+---+-+-+

I've also tried setting "spark.sql.warehouse.dir" to a valid HDFS path to no 
avail.

My application loads both ORC tables and AVRO files (using spark-avro) from 
HDFS.
When I load a table using spark.sql("select * from orc.`my-table-in-hdfs`"), I 
see WARN ObjectStore: Failed to get database orc, returning NoSuchObjectException.
When I load an AVRO file from HDFS using spark.read.avro(filename) , I see WARN 
DataSource: Error while looking for metadata directory.

Any ideas as to what I'm doing wrong?

--
Joe Naegele
Grier Forensics
410.220.0968


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark SQL 1.6.3 ORDER BY and partitions

2017-01-06 Thread Joseph Naegele
I have two separate but similar issues that I've narrowed down to a pretty good 
level of detail. I'm using Spark 1.6.3, particularly Spark SQL.

I'm concerned with a single dataset for now, although the details apply to 
other, larger datasets. I'll call it "table". It's around 160 M records, 
average of 78 bytes each, so about 12 GB uncompressed. It's 2 GB compressed in 
HDFS.

First issue:
The following query works if "table" is comprised of 200 partitions (on disk), 
but fails when "table" is 1200 partitions with the "Total size of serialized 
results of 1031 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 
GB)" error:

SELECT * FROM orc.`table` ORDER BY field DESC LIMIT 10;

This is possibly related to the TakeOrderedAndProject step in the execution 
plan, because the following queries do not give me problems:

SELECT * FROM orc.`table`;
SELECT * FROM orc.`table` ORDER BY field DESC;
SELECT * FROM orc.`table` LIMIT 10;

All of which have different execution plans.
My "table" has 1200 partitions because I must use a large value for 
spark.sql.shuffle.partitions to handle joins and window functions on much 
larger DataFrames in my application. Too many partitions may be suboptimal, but 
it shouldn't lead to large serialized results, correct?

Any ideas? I've seen https://issues.apache.org/jira/browse/SPARK-12837, but I 
think my issue is a bit more specific.


Second issue:
The difference between execution when calling .cache() and .count() on the 
following two DataFrames:

A: sqlContext.sql("SELECT * FROM table")
B: sqlContext.sql("SELECT * FROM table ORDER BY field DESC")

Counting the rows of A works as expected. A single Spark job with 2 stages. 
Load from Hadoop, map, aggregate, reduce to a number.

The same can't be said for B, however. The .cache() call spawns a Spark job 
before I even call .count(), loading from HDFS and performing ConvertToSafe and 
Exchange. The .count() call spawns another job, the first task of which appears 
to re-load from HDFS and again perform ConvertToSafe and Exchange, writing 1200 
shuffle partitions. The next stage then proceeds to read the shuffle data 
across only 2 tasks. One of these tasks completes immediately and the other 
runs indefinitely, failing because the partition is too large (the 
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE error).

Does this behavior make sense at all? Obviously it doesn't make sense to sort 
rows if I'm just counting them, but this is a simplified example of a more 
complex application in which caching makes sense. My executors have more than 
enough memory to cache this entire DataFrame.

Thanks for reading

---
Joe Naegele
Grier Forensics



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Storage history in web UI

2017-01-03 Thread Joseph Naegele
Hi all,

Is there any way to observe Storage history in Spark, i.e. which RDDs were 
cached and where, etc. after an application completes? It appears the Storage 
tab in the History Server UI is useless.

Thanks
---
Joe Naegele
Grier Forensics



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Joseph Naegele
Thanks Michael, hdfs dfsadmin -report tells me:

 

Configured Capacity: 7999424823296 (7.28 TB)

Present Capacity: 7997657774971 (7.27 TB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used: 38566006784 (35.92 GB)

DFS Used%: 0.48%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

 

-

Live datanodes (1):

 

Name: 127.0.0.1:50010 (localhost)

Hostname: XXX.XXX.XXX

Decommission Status : Normal

Configured Capacity: 7999424823296 (7.28 TB)

DFS Used: 38566006784 (35.92 GB)

Non DFS Used: 1767048325 (1.65 GB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used%: 0.48%

DFS Remaining%: 99.50%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 17

Last contact: Mon Dec 19 13:00:06 EST 2016

 

The Hadoop exception occurs because it times out after 60 seconds in a “select” 
call on a java.nio.channels.SocketChannel, while waiting to read from the 
socket. This implies the client writer isn’t writing on the socket as expected, 
but shouldn’t this all be handled by the Hadoop library within Spark?

 

It looks like a few similar, but rare, cases have been reported before, e.g. 
https://issues.apache.org/jira/browse/HDFS-770 which is *very* old.

 

If you’re pretty sure Spark couldn’t be responsible for issues at this level 
I’ll stick to the Hadoop mailing list.

 

Thanks

---

Joe Naegele

Grier Forensics

 

From: Michael Stratton [mailto:michael.strat...@komodohealth.com] 
Sent: Monday, December 19, 2016 10:00 AM
To: Joseph Naegele <jnaeg...@grierforensics.com>
Cc: user <user@spark.apache.org>
Subject: Re: [Spark SQL] Task failed while writing rows

 

It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin 
-report?

 

Anecdotally(And w/o specifics as it has been a while), I've generally used 
Parquet instead of ORC as I've gotten a bunch of random problems reading and 
writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it can be a 
pain.

 

On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <jnaeg...@grierforensics.com 
<mailto:jnaeg...@grierforensics.com> > wrote:

Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.lang.RuntimeException: Failed to commit task
Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: 0 expected: 32

Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. 
Aborting...

This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

 

2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output committer 
class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor 
exception  for block 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862425_121642

java.io.EOFException: Premature EOF: no length prefix available

at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)

at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)

at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)


My HDFS datanode says:

2016-12-09 02:39:24,783 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: HDFS_WRITE, 
cliID: DFSClient_attempt_201612090102__m_25_0_956624542_193, offset: 0, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration: 
93026972

2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, 
type=LAST_IN_PIPELINE, downstreams=0:[] terminating

2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:57790 dst: /127.0.0.1:50010 

 <http://java.net> java.net.SocketTimeoutException: 6 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 
remote=/127.0.0.1:57790]


It looks like the d

[Spark SQL] Task failed while writing rows

2016-12-18 Thread Joseph Naegele

Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

|org.apache.spark.SparkException: Task failed ||while| |writing rows
| |Caused by: java.lang.RuntimeException: Failed to commit task
| |Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: ||0| |expected: ||32
|
|Caused by: java.io.IOException: All datanodes ||127.0||.||0.1||:||50010| |are 
bad. Aborting...

|
This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

|
|2016||-||12||-||09| |02||:||38||:||12.193| |INFO DefaultWriterContainer: Using 
output committer ||class| 
|org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter|
2016||-||12||-||09| |02||:||41||:||04.679| |WARN DFSClient: DFSOutputStream 
ResponseProcessor exception ||for| |block 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862425_121642|
|java.io.EOFException: Premature EOF: no length prefix available|
|||at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:||2203||)|
|||at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:||176||)|
|||at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:||867||)|

My HDFS datanode says:

|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/||127.0||.||0.1||:||57836||, dest: /||127.0||.||0.1||:||50010||, bytes: 
||14808395||, op: HDFS_WRITE, cliID: 
DFSClient_attempt_201612090102__m_25_0_956624542_193, offset: ||0||, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 duration: ||93026972|
|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 type=LAST_IN_PIPELINE, downstreams=||0||:[] terminating|
|2016||-||12||-||09| |02||:||39||:||49||,||262| |ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
||XXX.XXX.XXX.XXX||:||50010||:DataXceiver error processing 
WRITE_BLOCK operation  src: /||127.0||.||0.1||:||57790| |dst: 
/||127.0||.||0.1||:||50010|
|java.net.SocketTimeoutException: ||6| |millis timeout ||while| |waiting 
||for| |channel to be ready ||for| |read. ch : 
java.nio.channels.SocketChannel[connected local=/||127.0||.||0.1||:||50010| 
|remote=/||127.0||.||0.1||:||57790||]|

It looks like the datanode is receiving the block on multiple ports (threads?) 
and one of the sending connections terminates early.

I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 
cores, 144 GB) and experienced many of these issues, where occasionally my job 
would fail altogether. Lowering the number of cores appears to reduce the 
frequency of these errors, however I'm now down to 4 executors with 2 cores 
each (Total: 8 cores), which is significantly less, and still see approximately 
1-3 task failures.

Details:
- Spark 1.6.3 - Standalone
- RDD compression enabled
- HDFS replication disabled
- Everything running on the same host
- Otherwise vanilla configs for Hadoop and Spark

Does anybody have any ideas or hints? I can't imagine the problem is solely 
related to the number of executor cores.

Thanks,
Joe Naegele


spark nightly builds with Hadoop 2.7

2016-09-09 Thread Joseph Naegele
Hello,

I'm using the Spark nightly build "spark-2.1.0-SNAPSHOT-bin-hadoop2.7" from
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/ due to
bugs in Spark 2.0.0 (SPARK-16740, SPARK-16802), however I noticed that the
recent builds only come in "-hadoop2.4-without-hive" and "-without-hadoop"
variants. I'm wondering if the "-hadoop2.7" flavor is intentionally no
longer being built and made available or if this is an error. The
"-hadoop2.7" was available until at least 8/24 (2 weeks ago).

Thanks!
Joe Naegele


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org