Re: Passwordless accounts vi ports!

2016-08-10 Thread O. Hartmann
On Thu, 11 Aug 2016 15:29:03 +1000
Dewayne Geraghty  wrote:

> Olivier,
> I've checked my 10.3Stable systems and they all have '*' as their password,
> which is consistent with /usr/ports/Mk/UIDs.  You might like to check the
> age of the latter.
> Regards, Dewayne.
> PS Both ports and src were built from updated src and ports from 2016-08-09

The system is a most recent CURRENT as compiled yesterday last time. The ports
tree is also up to date and updated on a daily basis, so are the ports.

Interestingly, the problem shows up only on one box so far, although all
other systems are also CURRENT and updated the very same way.

On another system, only user "bacula" has an empty password, were this user is
set correctly with a "*"-password on another system, on which I installed
bacula months earlier.

I checked the installation of the ports and their installating the
password-result again and all I tested (polkit, bacula, sane) did set the "*"
as expected (I deleted manually the password entry via vipw before).

I guess this "problem" is due to the fact I install ports and world on a daily
basis on such systems and the likelyhood hitting a interim bug is very high.

Regards,
Oliver
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Passwordless accounts vi ports!

2016-08-10 Thread Kurt Jaeger
Hi!

> I just checked the security scanning outputs of FreeBSD and found this
> surprising result:
> 
> [...]
> Checking for passwordless accounts:
> polkitd::565:565::0:0:Polkit Daemon User:/var/empty:/usr/sbin/nologin
> pulse::563:563::0:0:PulseAudio System User:/nonexistent:/usr/sbin/nologin
> saned::194:194::0:0:SANE Scanner Daemon:/nonexistent:/bin/sh
> clamav::106:106::0:0:Clamav Antivirus:/nonexistent:/usr/sbin/nologin
> bacula::910:910::0:0:Bacula Daemon:/var/db/bacula:/usr/sbin/nologin
> [...]
> 
> Obviously, some ports install accounts but do not secure them as there is an
> empty password.
> 
> I consider this not a feature, but a bug.

Indeed, but I can't reproduce it on my hosts. There must be some reason
for this to happen ?

-- 
p...@opsec.eu+49 171 3101372 4 years to go !
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Passwordless accounts vi ports!

2016-08-10 Thread Ngie Cooper

> On Aug 10, 2016, at 22:05, O. Hartmann  wrote:
> 
> I just checked the security scanning outputs of FreeBSD and found this
> surprising result:
> 
> [...]
> Checking for passwordless accounts:
> polkitd::565:565::0:0:Polkit Daemon User:/var/empty:/usr/sbin/nologin
> pulse::563:563::0:0:PulseAudio System User:/nonexistent:/usr/sbin/nologin
> saned::194:194::0:0:SANE Scanner Daemon:/nonexistent:/bin/sh
> clamav::106:106::0:0:Clamav Antivirus:/nonexistent:/usr/sbin/nologin
> bacula::910:910::0:0:Bacula Daemon:/var/db/bacula:/usr/sbin/nologin
> [...]
> 
> Obviously, some ports install accounts but do not secure them as there is an
> empty password.
> 
> I consider this not a feature, but a bug.

saned is the only one that might concern me because the login shell isn't 
nologin(1).

Cheers,
-Ngie
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Passwordless accounts vi ports!

2016-08-10 Thread O. Hartmann
I just checked the security scanning outputs of FreeBSD and found this
surprising result:

[...]
Checking for passwordless accounts:
polkitd::565:565::0:0:Polkit Daemon User:/var/empty:/usr/sbin/nologin
pulse::563:563::0:0:PulseAudio System User:/nonexistent:/usr/sbin/nologin
saned::194:194::0:0:SANE Scanner Daemon:/nonexistent:/bin/sh
clamav::106:106::0:0:Clamav Antivirus:/nonexistent:/usr/sbin/nologin
bacula::910:910::0:0:Bacula Daemon:/var/db/bacula:/usr/sbin/nologin
[...]

Obviously, some ports install accounts but do not secure them as there is an
empty password.

I consider this not a feature, but a bug.

Regards,
Oliver
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Possible zpool online, resilvering issue

2016-08-10 Thread Ultima
> A new transaction group (TXG) is created at LEAST every
> vfs.zfs.txg.timeout (defaults to 5) seconds.

> f you offline a drive for hours or more, it must have all blocks with a
> 'birth time' newer than the last transaction that was recorded on the
> offlined drive replayed to catch that drive up to the other drives in
> the pool.

> As long as you have enough redundancy, the checksum errors can be
> corrected without concern.

> In the end, the checksum errors can be written off as being caused by
> the bad hardware. After you finish the scrub and everything is OK, do:
> 'zpool clear poolname', and it will reset all of the error and checksum
> counts to 0, so you can track if any more ever show up.

Thanks Allan, can always count on you for crystal clear answers =]. I'm
surprised tho that it would be concluded as bad hardware(assuming you mean
hd?). Just seems like its too much of a coincidence. I always ran zpool
clear each time after the resilver/scrub was completed.


> Perhaps on or more of the drives running out of Realloc Sectors?
> I had once a case where smartctl showed no issues but zfs scrubbing showed
> a defect, some weeks later smartctl was showing some reallocated sectors
> and one week later the HD was out of spare sectors.

> Have you already tested every single HD for smart issues?


Smartd is set to run a short test weekly on Tuesday Thursday and Saturday.
Extended test is performed weekly on Tuesday an hour after the short test.
This occurs on all 24 drives. A scrub is performed once per month on
Saturday an hour after the short test.

5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always
  -   0

This is the value of Reallocated sectors on all the drives(I think this is
the normal value?). This drives smart looks like the worst of the lot.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (  592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: (   1) minutes.
Extended self-test routine
recommended polling time: ( 491) minutes.
Conveyance self-test routine
recommended polling time: (   2) minutes.
SCT capabilities:   (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED
 WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   072   063   044Pre-fail  Always
-   20189561
  3 Spin_Up_Time0x0003   091   091   000Pre-fail  Always
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always
-   188
  5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always
-   0
  7 Seek_Error_Rate 0x000f   092   085   030Pre-fail  Always
-   1802626788
  9 Power_On_Hours  0x0032   081   081   000Old_age   Always
-   17457
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always
-   158
184 End-to-End_Error0x0032   100   100   099Old_age   Always
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always
-   0
188 Command_Timeout 0x0032   100   099   000Old_age   Always
-   65537
189 High_Fly_Writes 0x003a   100   100   000Old_age   Always
-   0
190 Airflow_Temperature_Cel 0x0022   055   045   045Old_age   Always
In_the_past 45 (Min/Max 34/51)
191 G-Sense_Error_Rate  0x0032   100   100   000Old_age   Always
-   0
192 Power-Off_Retract_Count 0x0032   100   100   000Old_age   Always
-   157
193 Load_Cycle_Count0x0032   100   100   000Old_age   Always
-   867
194 Temperature_Celsius 0x0022   045   055   000Old_age   Always
-   45 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   053   011   000

Re: kernel panic caused by virtualbox(?)

2016-08-10 Thread Don Lewis
On 10 Aug, Jung-uk Kim wrote:
> On 08/09/16 05:12 AM, Konstantin Belousov wrote:
>> On Mon, Aug 08, 2016 at 04:44:20PM -0700, Don Lewis wrote:
>>> On  8 Aug, Konstantin Belousov wrote:
 On Mon, Aug 08, 2016 at 10:22:44AM -0700, John Baldwin wrote:
> On Thursday, August 04, 2016 05:10:29 PM Don Lewis wrote:
>> Reposted to -current to get some more eyes on this ...
>>
>> I just got a kernel panic when I started up a CentOS 7 VM in virtualbox.
>> The host is:
>>  FreeBSD 12.0-CURRENT #17 r302500 GENERIC amd64
>> The virtualbox version is:
>>  virtualbox-ose-5.0.26
>>  virtualbox-ose-kmod-5.0.26_1
>>
>> The panic message is:
>>
>> panic: Unregistered use of FPU in kernel
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
>> 0xfe085a55d030
>> vpanic() at vpanic+0x182/frame 0xfe085a55d0b0
>> kassert_panic() at kassert_panic+0x126/frame 0xfe085a55d120
>> trap() at trap+0x7ae/frame 0xfe085a55d330
>> calltrap() at calltrap+0x8/frame 0xfe085a55d330
>> --- trap 0x16, rip = 0x827dd3a9, rsp = 0xfe085a55d408, rbp = 
>> 0xfe085a55d430 ---
>> g_pLogger() at 0x827dd3a9/frame 0xfe085a55d430
>> g_pLogger() at 0x8274e5c7/frame 0x3
>> KDB: enter: panic
>>
>> Since g_pLogger is a symbol in vboxdrv.ko, it looks like virtualbox is
>> the trigger.
>>
>> There are no symbols for the virtualbox kmods, possibly because I
>> installed them as an upgrade using packages (built with the same source
>> tree version) instead of by using PORTS_MODULES in make.conf, so ports
>> kgdb didn't have anything useful to say about what happened before the
>> trap.
>>
>> This panic is very repeatable.  I just got another one when starting the
>> same VM., but this time the two calls before the trap were
>> null_bug_bypass().  Hmn, that symbol is in nullfs ...
>>
>> I don't see this with a Windows 7 VM.
>>
>> All of the virtualbox kmod files are compiled with -mno-mmx -mno-sse
>> -msoft-float -mno-aes -mno-avx
 Your disassemble listed fxrstor instruction that failing, or did I
 mis-remembered ? This is most likely some context switch code, either
 by virtual machine or erronously executed guest code. It is not a
 spontaneous use of FPU, but more likely something different. Can you
 confirm ?

 In either case, I do not remember any KBI changes around PCB layout or
 fpu_enter() KPI recently.

>
> I suspect head packages are quite likely built against the a "wrong" KBI
> and are too fragile to use for kmods vs compiling from ports. :-/  I would
> try a built-from-ports kmod to see if the panics go away.

 FWIW, I will commit the following change shortly. Since third-party
 modules break the invariant, either due to bugs (ndis wrappers) or
 possibly due to KBI breakage, it is worth to have the detection enabled
 for production kernels.
>>>
>>> Interesting ... I tried running virtualbox on recent 10.3-STABLE with a
>>> GENERIC kernel and the guest seemed to operate properly.  Then I enabled
>>> INVARIANTS and got the panic.  I suspect that is why nobody has stumbled
>>> across this before.
>>>
>> This is yet another reason to promote KASSERT to the full panic.
>> I expect that the vbox source lacks fpu_kern_enter() calls around the
>> FPU state restoration.
> 
> Unfortunately, the code is in MI source as it is unnecessary for
> supported OSes (read: FreeBSD is not supported) and it's not easy to
> inject fpu_kern_enter()/fpu_kern_leave() calls there. :-(

It's a headache, but our ports can use patch files for that sort of
thing ...

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: kernel panic caused by virtualbox(?)

2016-08-10 Thread Jung-uk Kim
On 08/09/16 05:12 AM, Konstantin Belousov wrote:
> On Mon, Aug 08, 2016 at 04:44:20PM -0700, Don Lewis wrote:
>> On  8 Aug, Konstantin Belousov wrote:
>>> On Mon, Aug 08, 2016 at 10:22:44AM -0700, John Baldwin wrote:
 On Thursday, August 04, 2016 05:10:29 PM Don Lewis wrote:
> Reposted to -current to get some more eyes on this ...
>
> I just got a kernel panic when I started up a CentOS 7 VM in virtualbox.
> The host is:
>   FreeBSD 12.0-CURRENT #17 r302500 GENERIC amd64
> The virtualbox version is:
>   virtualbox-ose-5.0.26
>   virtualbox-ose-kmod-5.0.26_1
>
> The panic message is:
>
> panic: Unregistered use of FPU in kernel
> cpuid = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xfe085a55d030
> vpanic() at vpanic+0x182/frame 0xfe085a55d0b0
> kassert_panic() at kassert_panic+0x126/frame 0xfe085a55d120
> trap() at trap+0x7ae/frame 0xfe085a55d330
> calltrap() at calltrap+0x8/frame 0xfe085a55d330
> --- trap 0x16, rip = 0x827dd3a9, rsp = 0xfe085a55d408, rbp = 
> 0xfe085a55d430 ---
> g_pLogger() at 0x827dd3a9/frame 0xfe085a55d430
> g_pLogger() at 0x8274e5c7/frame 0x3
> KDB: enter: panic
>
> Since g_pLogger is a symbol in vboxdrv.ko, it looks like virtualbox is
> the trigger.
>
> There are no symbols for the virtualbox kmods, possibly because I
> installed them as an upgrade using packages (built with the same source
> tree version) instead of by using PORTS_MODULES in make.conf, so ports
> kgdb didn't have anything useful to say about what happened before the
> trap.
>
> This panic is very repeatable.  I just got another one when starting the
> same VM., but this time the two calls before the trap were
> null_bug_bypass().  Hmn, that symbol is in nullfs ...
>
> I don't see this with a Windows 7 VM.
>
> All of the virtualbox kmod files are compiled with -mno-mmx -mno-sse
> -msoft-float -mno-aes -mno-avx
>>> Your disassemble listed fxrstor instruction that failing, or did I
>>> mis-remembered ? This is most likely some context switch code, either
>>> by virtual machine or erronously executed guest code. It is not a
>>> spontaneous use of FPU, but more likely something different. Can you
>>> confirm ?
>>>
>>> In either case, I do not remember any KBI changes around PCB layout or
>>> fpu_enter() KPI recently.
>>>

 I suspect head packages are quite likely built against the a "wrong" KBI
 and are too fragile to use for kmods vs compiling from ports. :-/  I would
 try a built-from-ports kmod to see if the panics go away.
>>>
>>> FWIW, I will commit the following change shortly. Since third-party
>>> modules break the invariant, either due to bugs (ndis wrappers) or
>>> possibly due to KBI breakage, it is worth to have the detection enabled
>>> for production kernels.
>>
>> Interesting ... I tried running virtualbox on recent 10.3-STABLE with a
>> GENERIC kernel and the guest seemed to operate properly.  Then I enabled
>> INVARIANTS and got the panic.  I suspect that is why nobody has stumbled
>> across this before.
>>
> This is yet another reason to promote KASSERT to the full panic.
> I expect that the vbox source lacks fpu_kern_enter() calls around the
> FPU state restoration.

Unfortunately, the code is in MI source as it is unnecessary for
supported OSes (read: FreeBSD is not supported) and it's not easy to
inject fpu_kern_enter()/fpu_kern_leave() calls there. :-(

Jung-uk Kim



signature.asc
Description: OpenPGP digital signature


Re: PORTS_MODULES breakage on HEAD

2016-08-10 Thread Bryan Drewery
On 8/7/16 5:44 PM, Don Lewis wrote:
> Adding PORTS_MODULES=emulators/virtualbox-ose-kmod recently broke on
> HEAD.  When I do that I get this failure:
> 
> ===> Ports module emulators/virtualbox-ose-kmod (all)
> cd ${PORTSDIR:-/usr/ports}/emulators/virtualbox-ose-kmod; 
> PATH=/usr/obj/usr/src/
> tmp/legacy/usr/sbin:/usr/obj/usr/src/tmp/legacy/usr/bin:/usr/obj/usr/src/tmp/leg
> acy/bin:/usr/obj/usr/src/tmp/usr/sbin:/usr/obj/usr/src/tmp/usr/bin:/sbin:/bin:/u
> sr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin  SRC_BASE=/usr/src  
> OSVERSION=12
> 0  WRKDIRPREFIX=/usr/obj/usr/src/sys/ make -B clean all
> ===>  Cleaning for virtualbox-ose-kmod-5.0.26_1
> ===>  License GPLv2 accepted by the user
> ===>  Found saved configuration for virtualbox-ose-kmod-4.3.34
> ===>   virtualbox-ose-kmod-5.0.26_1 depends on file: /usr/local/sbin/pkg - 
> found
> ===> Fetching all distfiles required by virtualbox-ose-kmod-5.0.26_1 for 
> buildin
> g
> ===>  Extracting for virtualbox-ose-kmod-5.0.26_1
> => SHA256 Checksum OK for VirtualBox-5.0.26.tar.bz2.
> ===>  Patching for virtualbox-ose-kmod-5.0.26_1
> ===>  Applying FreeBSD patches for virtualbox-ose-kmod-5.0.26_1
> ===>   virtualbox-ose-kmod-5.0.26_1 depends on executable: kmk - found
> ===>  Configuring for virtualbox-ose-kmod-5.0.26_1
> Checking for environment: Determined build machine: freebsd.amd64, target 
> machin
> e: freebsd.amd64, OK.
> Checking for kBuild: found, OK.
> Checking for gcc:
>   ** cc -target x86_64-unknown-freebsd12.0 --sysroot (variable CC) not found!
> Check 
> /usr/obj/usr/src/sys/usr/ports/emulators/virtualbox-ose-kmod/work/VirtualB
> ox-5.0.26/configure.log for details
> ===>  Script "configure" failed unexpectedly.
> Please report the problem to v...@freebsd.org [maintainer] and attach the
> "/usr/obj/usr/src/sys//usr/ports/emulators/virtualbox-ose-kmod/work/VirtualBox-5
> .0.26/config.log"
> 
> 
> It appears that the problem is due to CC being set to:
>   cc -target x86_64-unknown-freebsd12.0 --sysroot
> and the Makefile for the port passes this:
>   --with-gcc="${CC}"
> to configure.  The configure script passes $CC to check_avail, which
> does a -z test on it.
> 
> I think that CC should just be set to "cc" and the rest should get added
> to CFLAGS.  I suspect this got broken by the recent crossbuild changes.
> 


It's a SYSTEM_COMPILER bug.  I'll look into fixing it.

For now you can try passing WITHOUT_SYSTEM_COMPILER=yes as a workaround.


-- 
Regards,
Bryan Drewery



signature.asc
Description: OpenPGP digital signature


Re: Mosh regression between 10.x and 11-stable

2016-08-10 Thread john hood
On 8/10/16 4:18 AM, Peter Jeremy wrote:
> I recently updated one of my VPS hosts from 10.3-RELEASE-p5 to 11.0-BETA4
> r303811 and mosh to that host from my Linux laptop stopped working.  All
> I get on the laptop is:
> $ mosh remotehost
> Connection to remotehost closed.
> /usr/bin/mosh: Did not find mosh server startup message.
> 
> I've tried rebuilding mosh (and all dependencies) on the host to no avail.

I'm a mosh maintainer.  mosh 1.2.5 (from ports) and mosh master (just
last night tagged as 1.2.6, alas) work fine for me on my two 11.0-BETA4
systems, one local and one remote.

> This isn't the DSA change that's been discussed elsewhere: I can SSH from my
> laptop to the host without problem.  I can also manually invoke mosh-client
> and mosh-server and it works.  Unfortunately, mosh has no provision for
> debugging.  I've tried hacking the mosh perl script to make it more verbose
> and that shows that:
> 1) the "MOSH CONNECT" message isn't making it out of the local ssh process.

Do you know if the message is getting out of mosh-server?  into sshd?
Do you know if mosh-server is actually running?  (It will log utmp
entries on startup.)

Mosh's debugging/logging isn't very good, but 'mosh-server new -v 2>
logfile' does produce some useful info (mostly logging of network traffic).

> 2) it's racy because I can get it from "always fails" to "sometimes works".

How do you get it there?

> My suspicion is that something has changed in either sshd or TCP that
> is resulting in the connection going away before the stdout from the
> remote mosh-server makes it out from the local ssh process.

mosh does 'ssh -t' and uses ptys.  That's another potential point the
message could get dropped.

> I've looked at tcpdump's of both successful and failed SSH sessions
> but don't see anything obviously different (encryption makes it
> difficult to decode the session).
> 
> Has anyone else seen this behaviour or have any ideas what might be
> causing it?

Common suspects include issues with shell login/invocation of mosh (are
you making sure it's reachable in /usr/local/bin with $PATH or
'--server=/usr/local/bin/mosh'?  are your login shell and its login
scripts unusual?)

On Linux we've had issues with ecryptfs and systemd breaking mosh-server
when the ssh session ends, but I don't think that applies here.

regards,

  --jh

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Possible zpool online, resilvering issue

2016-08-10 Thread olli hauer
On 2016-08-04 07:22, Ultima wrote:
> Hello,
> 
> I recently had some issue with a PSU and ran several scrubs on a pool with
> around 35T. Random drives would drop and require a zpool online, this found
> checksum errors. (as expected) However, after all the scrubs I ran, I think
> I may have found a bug with zpool online resilvering process.
> 
> 24 disks total, 4 vdevs raidz2 (6 drives each).
> 
> Before this next part... I had a backup PSU, however it was also going bad
> and waiting for RMA. The current one seemed to be dieing but ran fine with
> less drives. So I decided I would run the server short 4 drives.
> 
> Started by offline(or already removed from psu) 4 drives from different
> vdevs, then ran a scrub to verify everything. Many sum errors were present
> on some of the drives, but this was expected due to faulty psu. Then
> offlined 4 different drives and onlined the other 4 and scrubbed once
> again. After resilver, again, many sum errors on these drives as expected.
> 
> After the scrub completed, I decided to offline 4 different drives, then
> online the ones that were out of pool for awhile. During the resilver,
> checksum errors were once again found. I was surprised due to the recent
> scrub, So I decided to run another scrub, and it found even more checksum
> errors on these recently onlined drives. I didn't think much about it,
> however after the replacement PSU arrived, I onlined all the drives out of
> pool and again, resilver had checksum errors as well as another scrub with
> more sum errors.
> 
> Is this issue known? Is it common for a scrub to be required after onlining
> a disk that was out of pool for some time?
> 
> The drives are ST4000NM0033, and until recent have never had a single
> checksum error in they're lifetime.(at least with zfs)
> FreeBSD S1 12.0-CURRENT FreeBSD 12.0-CURRENT #19 r303224: Sat Jul 23
> 10:41:12 EDT 2016
> root@S1:/usr/src/head/obj/usr/src/head/src/sys/MYKERNEL-NODEBUG
>  amd64
> 
> 
> Sorry for the wall of text, but I hope this helps in tracking down this
> possible bug.
> 

Perhaps on or more of the drives running out of Realloc Sectors?
I had once a case where smartctl showed no issues but zfs scrubbing showed
a defect, some weeks later smartctl was showing some reallocated sectors
and one week later the HD was out of spare sectors.

Have you already tested every single HD for smart issues?

-- 
olli
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Mosh regression between 10.x and 11-stable

2016-08-10 Thread john hood
On 8/10/16 4:18 AM, Peter Jeremy wrote:
> I recently updated one of my VPS hosts from 10.3-RELEASE-p5 to 11.0-BETA4
> r303811 and mosh to that host from my Linux laptop stopped working.  All
> I get on the laptop is:
> $ mosh remotehost
> Connection to remotehost closed.
> /usr/bin/mosh: Did not find mosh server startup message.
> 
> I've tried rebuilding mosh (and all dependencies) on the host to no avail.

I'm a mosh maintainer.  mosh 1.2.5 (from ports) and mosh master (just
last night tagged as 1.2.6, alas) work fine for me on my two 11.0-BETA4
systems, one local and one remote.

> This isn't the DSA change that's been discussed elsewhere: I can SSH from my
> laptop to the host without problem.  I can also manually invoke mosh-client
> and mosh-server and it works.  Unfortunately, mosh has no provision for
> debugging.  I've tried hacking the mosh perl script to make it more verbose
> and that shows that:
> 1) the "MOSH CONNECT" message isn't making it out of the local ssh process.

Do you know if the message is getting out of mosh-server?  into sshd?
Do you know if mosh-server is actually running?  (It will log utmp
entries on startup.)

Mosh's debugging/logging isn't very good, but 'mosh-server new -v 2>
logfile' does produce some useful info (mostly logging of network traffic).

> 2) it's racy because I can get it from "always fails" to "sometimes works".

How do you get it there?

> My suspicion is that something has changed in either sshd or TCP that
> is resulting in the connection going away before the stdout from the
> remote mosh-server makes it out from the local ssh process.

mosh does 'ssh -t' and uses ptys.  That's another potential point the
message could get dropped.

> I've looked at tcpdump's of both successful and failed SSH sessions
> but don't see anything obviously different (encryption makes it
> difficult to decode the session).
> 
> Has anyone else seen this behaviour or have any ideas what might be
> causing it?

Common suspects include issues with shell login/invocation of mosh (are
you making sure it's reachable in /usr/local/bin with $PATH or
'--server=/usr/local/bin/mosh'?  are your login shell and its login
scripts unusual?)

On Linux we've had issues with ecryptfs and systemd breaking mosh-server
when the ssh session ends, but I don't think that applies here.

regards,

  --jh

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Possible zpool online, resilvering issue

2016-08-10 Thread Stefan Esser
Am 10.08.2016 um 18:53 schrieb Ultima:
> Hello,
> 
>> I didn't see any reply on the list, so I thought I might let you know
> 
> Sorry, never received this reply (till now) xD
> 
>>what I assume is happening:
> 
>> ZFS never updates data in place, which affects inode updates, e.g. if
>> a file has been read and access times must be updated. (For that reason,
>> many ZFS file systems are configured to ignore access time updates).
> 
>> Even if there were only R/O accesses to files in the pool, there will
>> have been updates to the inodes, which were missed by the offlined
>> drives (unless you ignore atime updates).
> 
>> But even if there are no access time updates, ZFS might have written
>> new uberblocks and other meta information. Check the POOL history and
>> see if there were any TXGs created during the scrub.
> 
>> If you scrub the pooll while it is off-line, it should stay stable
>> (but if any information about the scrub, the offlining of drives etc.
>> is recorded in the pool's history log, differences are to be expected).
> 
>> Just my $.02 ...
> 
>> Regards, STefan
> 
> Thanks for the reply, I'm not completely sure what would be considered a
> TXG. Maintained normal operations during most this noise and this pool
> has quite a bit of activity during normal operations. My zpool history
> looks like it gos on forever and the last scrub is showing it repaired
> 9.48G. That was for all these access time updates? I guess that would be
> a little less then 2.5G per disk worth.
> 
> The zpool history looks like it gos on forever (733373 lines). This pool
> has much of this activity with poudriere. All the entries I see are
> clone, destroy, rollback and snapshotting. I can't really say how much
> but at least 500 (prob much more than that) entries between the last two
> scrubs. Atime is off on all datasets.
> 
>  So to be clear, this is expected behavior with atime=off + TXGs during
> offline time? I had thought that the resilver after onlining the disk
> would bring that disk up-to-date with the pool. I guess my understanding
> was a bit off.

Sorry, you'll have to ask somebody more familiar with ZFS internals
than me.

I just wanted to point out, that scrub might change the state of the
drives, even though no file data is modified.

Some 10 GB "repaired" on a 35000 GB pool is not much, it is about what
I'd expect to be required for meta-data.

BTW: The pool history is chronologically sorted, you need only check
the last few lines (written after the start time of the scrub, or
rather written after offlining some of the disk drives).

Regards, STefan
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Possible zpool online, resilvering issue

2016-08-10 Thread Allan Jude
On 2016-08-10 12:53, Ultima wrote:
> Hello,
> 
>> I didn't see any reply on the list, so I thought I might let you know
> 
> Sorry, never received this reply (till now) xD
> 
>> what I assume is happening:
> 
>> ZFS never updates data in place, which affects inode updates, e.g. if
>> a file has been read and access times must be updated. (For that reason,
>> many ZFS file systems are configured to ignore access time updates).
> 
>> Even if there were only R/O accesses to files in the pool, there will
>> have been updates to the inodes, which were missed by the offlined
>> drives (unless you ignore atime updates).
> 
>> But even if there are no access time updates, ZFS might have written
>> new uberblocks and other meta information. Check the POOL history and
>> see if there were any TXGs created during the scrub.
> 
>> If you scrub the pooll while it is off-line, it should stay stable
>> (but if any information about the scrub, the offlining of drives etc.
>> is recorded in the pool's history log, differences are to be expected).
> 
>> Just my $.02 ...
> 
>> Regards, STefan
> 
> Thanks for the reply, I'm not completely sure what would be considered a
> TXG. Maintained normal operations during most this noise and this pool has
> quite a bit of activity during normal operations. My zpool history looks
> like it gos on forever and the last scrub is showing it repaired 9.48G.
> That was for all these access time updates? I guess that would be a little
> less then 2.5G per disk worth.
> 
> The zpool history looks like it gos on forever (733373 lines). This pool
> has much of this activity with poudriere. All the entries I see are clone,
> destroy, rollback and snapshotting. I can't really say how much but at
> least 500 (prob much more than that) entries between the last two scrubs.
> Atime is off on all datasets.
> 
>  So to be clear, this is expected behavior with atime=off + TXGs during
> offline time? I had thought that the resilver after onlining the disk would
> bring that disk up-to-date with the pool. I guess my understanding was a
> bit off.
> 
> Ultima
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> 

A new transaction group (TXG) is created at LEAST every
vfs.zfs.txg.timeout (defaults to 5) seconds.

If you offline a drive for hours or more, it must have all blocks with a
'birth time' newer than the last transaction that was recorded on the
offlined drive replayed to catch that drive up to the other drives in
the pool.

As long as you have enough redundancy, the checksum errors can be
corrected without concern.

In the end, the checksum errors can be written off as being caused by
the bad hardware. After you finish the scrub and everything is OK, do:
'zpool clear poolname', and it will reset all of the error and checksum
counts to 0, so you can track if any more ever show up.


-- 
Allan Jude



signature.asc
Description: OpenPGP digital signature


Re: Possible zpool online, resilvering issue

2016-08-10 Thread Ultima
Hello,

> I didn't see any reply on the list, so I thought I might let you know

Sorry, never received this reply (till now) xD

>what I assume is happening:

> ZFS never updates data in place, which affects inode updates, e.g. if
> a file has been read and access times must be updated. (For that reason,
> many ZFS file systems are configured to ignore access time updates).

> Even if there were only R/O accesses to files in the pool, there will
> have been updates to the inodes, which were missed by the offlined
> drives (unless you ignore atime updates).

> But even if there are no access time updates, ZFS might have written
> new uberblocks and other meta information. Check the POOL history and
> see if there were any TXGs created during the scrub.

> If you scrub the pooll while it is off-line, it should stay stable
> (but if any information about the scrub, the offlining of drives etc.
> is recorded in the pool's history log, differences are to be expected).

> Just my $.02 ...

> Regards, STefan

Thanks for the reply, I'm not completely sure what would be considered a
TXG. Maintained normal operations during most this noise and this pool has
quite a bit of activity during normal operations. My zpool history looks
like it gos on forever and the last scrub is showing it repaired 9.48G.
That was for all these access time updates? I guess that would be a little
less then 2.5G per disk worth.

The zpool history looks like it gos on forever (733373 lines). This pool
has much of this activity with poudriere. All the entries I see are clone,
destroy, rollback and snapshotting. I can't really say how much but at
least 500 (prob much more than that) entries between the last two scrubs.
Atime is off on all datasets.

 So to be clear, this is expected behavior with atime=off + TXGs during
offline time? I had thought that the resilver after onlining the disk would
bring that disk up-to-date with the pool. I guess my understanding was a
bit off.

Ultima
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Signal 12 on make update (or any target in /usrc/src)

2016-08-10 Thread Konstantin Belousov
On Wed, Aug 10, 2016 at 10:49:40AM -0400, Matteo Riondato wrote:
> 
> > On Aug 10, 2016, at 10:41 AM, Konstantin Belousov  
> > wrote:
> > On Wed, Aug 10, 2016 at 10:33:23AM -0400, Matteo Riondato wrote:
> >> Hi all,
> >> 
> >> I recently upgraded from a late June (pre 11-branch, as far as I can tell) 
> >> revision to r303771.
> >> 
> >> Now, running ???make update??? (or buildworld, ???) in /usr/src fails with 
> >> a signal 12:
> >> 
> >> matteo@triton:/usr/src$ sudo make update
> >> Password:
> >> *** Signal 12
> > 
> > You did not updated, I think.  You, most likely, inly updated the kernel,
> > but left the old userspace in place, at least libc.
> 
> That would be surprising but it may have happened, as I don???t remember 
> without doubts to have run installworld :/
> 
> > Signal 12 is SIGSYS, which means that the program tries to use a syscall
> > not implemented by the kernel.  My guess is that your kernel lacks option
> > COMPAT_FREEBSD10, and the failing syscall is pipe(2).
> 
> Indeed I do not have COMPAT_FREEBSD10, because I believed my previous world 
> revision was >302092, as noted by the entry about pipe(2) in UPDATING.
> 
> Any suggestion on how to fix this?
> Boot the old kernel, add COMPAT_FREEBSD10 to kernel config, and 
> rebuild/install world and kernel perhaps?
> 

If old kernel works, then this would allow you to recover.

Take libc.so.7 from the BETA-4, and put it into /lib, taking backup of
your current libc first.  I suspect this is the easiest route if old
kernel does not match with your world.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Signal 12 on make update (or any target in /usrc/src)

2016-08-10 Thread Matteo Riondato

> On Aug 10, 2016, at 10:41 AM, Konstantin Belousov  wrote:
> On Wed, Aug 10, 2016 at 10:33:23AM -0400, Matteo Riondato wrote:
>> Hi all,
>> 
>> I recently upgraded from a late June (pre 11-branch, as far as I can tell) 
>> revision to r303771.
>> 
>> Now, running ???make update??? (or buildworld, ???) in /usr/src fails with a 
>> signal 12:
>> 
>> matteo@triton:/usr/src$ sudo make update
>> Password:
>> *** Signal 12
> 
> You did not updated, I think.  You, most likely, inly updated the kernel,
> but left the old userspace in place, at least libc.

That would be surprising but it may have happened, as I don’t remember without 
doubts to have run installworld :/

> Signal 12 is SIGSYS, which means that the program tries to use a syscall
> not implemented by the kernel.  My guess is that your kernel lacks option
> COMPAT_FREEBSD10, and the failing syscall is pipe(2).

Indeed I do not have COMPAT_FREEBSD10, because I believed my previous world 
revision was >302092, as noted by the entry about pipe(2) in UPDATING.

Any suggestion on how to fix this?
Boot the old kernel, add COMPAT_FREEBSD10 to kernel config, and rebuild/install 
world and kernel perhaps?

Thanks for the help!

Matteo



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Signal 12 on make update (or any target in /usrc/src)

2016-08-10 Thread Konstantin Belousov
On Wed, Aug 10, 2016 at 10:33:23AM -0400, Matteo Riondato wrote:
> Hi all,
> 
> I recently upgraded from a late June (pre 11-branch, as far as I can tell) 
> revision to r303771.
> 
> Now, running ???make update??? (or buildworld, ???) in /usr/src fails with a 
> signal 12:
> 
> matteo@triton:/usr/src$ sudo make update
> Password:
> *** Signal 12

You did not updated, I think.  You, most likely, inly updated the kernel,
but left the old userspace in place, at least libc.

Signal 12 is SIGSYS, which means that the program tries to use a syscall
not implemented by the kernel.  My guess is that your kernel lacks option
COMPAT_FREEBSD10, and the failing syscall is pipe(2).
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Signal 12 on make update (or any target in /usrc/src)

2016-08-10 Thread Matteo Riondato
Hi all,

I recently upgraded from a late June (pre 11-branch, as far as I can tell) 
revision to r303771.

Now, running “make update” (or buildworld, …) in /usr/src fails with a signal 
12:

matteo@triton:/usr/src$ sudo make update
Password:
*** Signal 12

Stop.
make: stopped in /usr/src
.ERROR_TARGET='update'
.ERROR_META_FILE=''
.MAKE.LEVEL='0'
MAKEFILE=''
.MAKE.MODE='normal'
.CURDIR='/usr/src'
.MAKE='make'
.OBJDIR='/usr/obj/usr/src'
.TARGETS='update'
DESTDIR=''
LD_LIBRARY_PATH=''
MACHINE='amd64'
MACHINE_ARCH='amd64'
MAKEOBJDIRPREFIX='/usr/obj'
MAKESYSPATH='/usr/src/share/mk'
MAKE_VERSION='20160606'
PATH='/sbin:/bin:/usr/sbin:/usr/bin'
SRCTOP='/usr/src'
OBJTOP='/usr/obj/usr/src

Installing ports using “make install” works.

Relevant (?) section of src.conf:
WITH_CCACHE_BUILD=y
WITH_SYSTEM_COMPILER=y

src-env.conf:
WITH_META_MODE=yes

make.conf:
KERNCONF=TRITON
CPUTYPE?=k8-sse3
SVN_UPDATE=y
COPTFLAGS=-O2 -pipe
MALLOC_PRODUCTION=y

Any hints?

Thanks,
Matteo



signature.asc
Description: Message signed with OpenPGP using GPGMail


Mosh regression between 10.x and 11-stable

2016-08-10 Thread Peter Jeremy
I recently updated one of my VPS hosts from 10.3-RELEASE-p5 to 11.0-BETA4
r303811 and mosh to that host from my Linux laptop stopped working.  All
I get on the laptop is:
$ mosh remotehost
Connection to remotehost closed.
/usr/bin/mosh: Did not find mosh server startup message.

I've tried rebuilding mosh (and all dependencies) on the host to no avail.

This isn't the DSA change that's been discussed elsewhere: I can SSH from my
laptop to the host without problem.  I can also manually invoke mosh-client
and mosh-server and it works.  Unfortunately, mosh has no provision for
debugging.  I've tried hacking the mosh perl script to make it more verbose
and that shows that:
1) the "MOSH CONNECT" message isn't making it out of the local ssh process.
2) it's racy because I can get it from "always fails" to "sometimes works".

My suspicion is that something has changed in either sshd or TCP that
is resulting in the connection going away before the stdout from the
remote mosh-server makes it out from the local ssh process.

I've looked at tcpdump's of both successful and failed SSH sessions
but don't see anything obviously different (encryption makes it
difficult to decode the session).

Has anyone else seen this behaviour or have any ideas what might be
causing it?

-- 
Peter Jeremy


signature.asc
Description: PGP signature